Multicollinearity
Two or more independent variables are highly correlated with one another in the regression model.
When multicollinearity exists in the regression model, it is difficult to distinguish the effects of the independent variables on the dependent variable individually. It might not affect the model's accuracy much, but the model will be less reliable.
House Sales in King County dataset from Kaggle to demonstrate the effects of overfitting. The dataset includes the records of houses sold between 2014 to 2015. We will build the model to predict house prices (target variable: price).
Model | Train | Test | Validation |
---|---|---|---|
Linear Regression (5-Fold) | 75.61 | 74.82 | 74.08 |
Correlation Analysis
The example above shows a correlation between variables. Correlation does not equal multicollinearity. However, it is worth checking which features should be further investigated.
Let's take a look at a couple of independent variables. We can see that there are high correlation scores for the features 'sqft_living' and 'sqft_above'.
From the correlation matrix, two variables that are worth investigating are sqft_living and sqft_above. We build a model to train these two variables separately.
Variance Inflation Factor (VIF)
Multicollinearity can be reduced by eliminating one variable or combining them by checking the VIF score for each variable.
Let's calculate the VIF score for some of the features. Here, we train the regression model for each independent variable ignoring the dependent variable (price in this case).For example,
Features | ||
---|---|---|
sqft_living | 99 | 100 |
sqft_above | 99 | 100 |
bathroom | 70.67 | 3.45 |
grade | 70.13 | 3.35 |
sqft_living15 | 68.75 | 3.2 |
Here we have two features that have high scores of VIF. Theoretically, if the VIF score is more than 5, it implies that multicollinearity does exist in our data. To reduce this effect, we will eliminate one of these features. Let's start with dropping the 'sqft_living' feature. Now, let's check again the VIF!
Features | ||
---|---|---|
sqft_living | - | - |
sqft_above | 80.97 | 5.25 |
bathroom | 70.67 | 3.45 |
grade | 70.13 | 3.35 |
sqft_living15 | 68.75 | 3.2 |
Looks good so far, VIF score of the'sqft_above' feature is decreased from 100 to 5.25. In general, we would not want to keep the variable that has VIF score higher than 5, but it also depends on your business goal if you want to see the relationship of the model with a specific variable. However, here we just want to demonstrate the effect of multicollinearity, let's drop it and try again.
Features | ||
---|---|---|
sqft_living | - | - |
sqft_above | - | - |
bathroom | 67.5 | 3.07 |
grade | 64.88 | 2.85 |
sqft_living15 | 61.61 | 3.61 |
Now, We remove both the 'sqft_living' and 'sqft_above' features and train again with LinearRegression both with and without cross-validation.
Model | Train | Test | Validation |
---|---|---|---|
Linear Regresssion (5 Fold) | 72.82 | 72.99 | 71.58 |
Ridge (5 Fold) | 72.93 | 73.17 | 70.94 |
Table 5 shows the results of the trained model after dropping the aforementioned variables. As mentioned earlier, it does not affect the model's predictive accuracy. Let's say if your task is to make predictions, you would not need to reduce the effect of multicollinearity. However, if your task is to analyze and understand the role of each independent variable, multicollinearity surely needs to be investigated.
Summary
We have demonstrated the importance of relationships between independent variables. There might be some other solutions other than dropping variables such as training with another advanced model. However, it is worth exploring the data to avoid this multicollinearity effect to build a more reliable model.