Multicollinearity

Two or more independent variables are highly correlated with one another in the regression model.

When multicollinearity exists in the regression model, it is difficult to distinguish the effects of the independent variables on the dependent variable individually. It might not affect the model's accuracy much, but the model will be less reliable.

House Sales in King County dataset from Kaggle to demonstrate the effects of overfitting. The dataset includes the records of houses sold between 2014 to 2015. We will build the model to predict house prices (target variable: price).

Model Train Test Validation
Linear Regression (5-Fold) 75.61 74.82 74.08

Table 1: The Linear Regression model with 5-fold cross-validation.

Correlation Analysis

The example above shows a correlation between variables. Correlation does not equal multicollinearity. However, it is worth checking which features should be further investigated.

Alt Text

Figure 1: Pearson correlations matrix of House Sales in King County dataset

Let's take a look at a couple of independent variables. We can see that there are high correlation scores for the features 'sqft_living' and 'sqft_above'.

Image_Caption

Figure 2: The correlation between the 'sqft_living' and 'sqft_above'features.

Image_Caption

Figure 3: The correlation between the 'sqft_living' and 'bathroom' features.

Now let's calculate the Variance Inflation Factor (VIF), which is a measure of the amount of multicollinearity of regression variables, that two independent variables are highly correlated.

From the correlation matrix, two variables that are worth investigating are sqft_living and sqft_above. We build a model to train these two variables separately.

Image_Caption

Variance Inflation Factor (VIF)

Multicollinearity can be reduced by eliminating one variable or combining them by checking the VIF score for each variable.

Let's calculate the VIF score for some of the features. Here, we train the regression model for each independent variable ignoring the dependent variable (price in this case).For example,

sqft_living = a * sqft_above + b * bathroom + c * grade


sqft_above = b * bathroom + c * grade + d * sqft_living


Features
sqft_living 99 100
sqft_above 99 100
bathroom 70.67 3.45
grade 70.13 3.35
sqft_living15 68.75 3.2

Table 2: VIF Score

Here we have two features that have high scores of VIF. Theoretically, if the VIF score is more than 5, it implies that multicollinearity does exist in our data. To reduce this effect, we will eliminate one of these features. Let's start with dropping the 'sqft_living' feature. Now, let's check again the VIF!

Features
sqft_living - -
sqft_above 80.97 5.25
bathroom 70.67 3.45
grade 70.13 3.35
sqft_living15 68.75 3.2

Table 3: VIF Score after dropping the 'sqft_living' feature

Looks good so far, VIF score of the'sqft_above' feature is decreased from 100 to 5.25. In general, we would not want to keep the variable that has VIF score higher than 5, but it also depends on your business goal if you want to see the relationship of the model with a specific variable. However, here we just want to demonstrate the effect of multicollinearity, let's drop it and try again.

Features
sqft_living - -
sqft_above - -
bathroom 67.5 3.07
grade 64.88 2.85
sqft_living15 61.61 3.61

Table 4: VIF Score after dropping the 'sqft_living'and 'sqft_above features

Now, We remove both the 'sqft_living' and 'sqft_above' features and train again with LinearRegression both with and without cross-validation.

Model Train Test Validation
Linear Regresssion (5 Fold) 72.82 72.99 71.58
Ridge (5 Fold) 72.93 73.17 70.94

Table 5: Summary of r-squared scores trained with Linear Regression and Ridge Regresssion after dropping variables.

Table 5 shows the results of the trained model after dropping the aforementioned variables. As mentioned earlier, it does not affect the model's predictive accuracy. Let's say if your task is to make predictions, you would not need to reduce the effect of multicollinearity. However, if your task is to analyze and understand the role of each independent variable, multicollinearity surely needs to be investigated.

Summary

We have demonstrated the importance of relationships between independent variables. There might be some other solutions other than dropping variables such as training with another advanced model. However, it is worth exploring the data to avoid this multicollinearity effect to build a more reliable model.



Reference:

  1. Predicting House Prices: more information is available here

  2. Multicollinearity: more information is available here

  3. Correlation and Collinearity: more information is available here