-
Loaded the data into a dataframe df
-
The dataset has 155 rows and 20 columns
-
The target column Response Variable is continuous in nature, thus it is a Regression task and not a Classification
-
The data has no missing values.
-
I drop the date column since it has 155 unique values thus it carries no predictive power.
-
I split my data into X & Y where X contain the features and Y contains my target variable " Response Variable"
-
After splitting my data into X & Y, I split my data into train and test datasets with 70% of data being used for training and 30% data being used for testing.
-
Now, I start by analysing the target variable distribution.
- The target variable is positively skewed with a mean value close to 891, so we will have to apply transformation to convert it to near normal distribution.
- I apply log10 transformation thrice to make the overall distribution near normal with a skew value of 0.25.
-
Next up, I try to analyse the relationship of features with response variables using Seaborn's jointplot and this is what I observe
- Feature 1 & 2 have a slightly significant negative relationship with the response variable.
- Feature 4 has no relationship with the response variable.
- Feature 3, 5, 6, 7, 8, 9, 10, 11, 12, 13 & 17 have a highly significant positive relationship with the response variable.
- Feature 14 & 18 have a highly significant negative relationship with the response variable.
- Feature 15 & 16 have a slightly significant positive relationship with the response variable.
-
Next up, I plot a heatmap of features to find out their correlation strength with the target features and also try to infer if there is any multicollinearity in the data.
- The features - Feature 9, 11, 10, 17, 13, 12 & 6 are heavily correlated with the target variable.
- Feature 9, 11 & 10 exhibit multicollinearity.
- Feature 12 & 6 are also heavily correlated with each other.
- There are many such combinations that you fill find of collinearity.
-
I now use Variance inflation factor (VIF) for each feature to find out if the feature can be described using other features and a general rule of thumb commonly used in practice is if a VIF is > 5, you have high multicollinearity. So, I eliminate features with very high VIF score and I'm left with features 1, 4, 5, 13 and 15.
-
I again plot a heatmap to see if there is no multicollinearity in the data and I find heavy correlation between Feature 13 and Feature 5.
-
Since Feature 13 is more significantly correlated with Response Variable, I drop Feature 5 from the training dataset.
-
I use the OLS function from the statsmodel package.
- We start the exercise by using the statsmodel based OLS model to fit X & Y because of the results provided by the model that explain the significance of each predictor variable.
- The F-test of overall significance indicates whether your linear regression model provides a better fit to the data than a model that contains no independent variables. So, in our case based on the p-value of F-test or Prob (F-statistic) we conclude that the model is a good fit model.
- R-squared value of 0.69 signifies it a descent fit model.
- The P>|t| or the p-value of the feature signifies how significant a feature is.
- In Linear Regression, the Null Hypothesis for a feature & target variable is that there is no relationship between target & the feature.
- Alternate Hypothesis states that there is no relationship between target and feature.
- A p-value of less than 0.05 signifies that the features Feature 1, 4, 13 are significant predictors and are related with the target variable.
- However, Feature 15 has a p-value of 0.064 which makes it insignificant for the given target variable.
- On testing dataset, the RMSE value comes to be really small i.e. 0.000244 which is a characteristic of a good model.
- The r-squared score during testing comes out to be 0.635 which is decently good given its a simple model that we created without regularization using the limited training dataset that we had.
- Also, I utilize the residuals of the model to validate the regression assumptions and quality of the OLS model created.
- By plotting a scatter plot between residuals and the predicted value, the plot seems to be reasonably random.
- By visualizing the probability plot, There is a good fit of observed and expected thus indicating that normality is a reasonable approximation.
-
I also use Ridge Regression to create a model.
- Ridge regression is a technique to reduce model complexity & prevent over-fitting which may result from simple linear regression.
- I first find out the best value of alpha (penalty term) using GridSearch & cross validation by using neg_mean_squared_error as the scoring criterion.
- Based on the data and scoring criteria, the best value of alpha comes out to be 20.
- I fit the data on the best estimator from GridSearch.
- On testing dataset, the RMSE value is similar to the OLS model.
- The R-squared score during testing comes out to be a bit better as compared to the OLS model.
- Yes, I have removed a lot of columns due to a high value of Multicollinearity exhibited by them. Here are some values and the columns dropped
- Dropping 'Feature 11' with VIF value : 3942.998747788726
- Dropping 'Feature 6' with VIF value : 1091.6086219503093
- Dropping 'Feature 16' with VIF value : 974.4334324017261
- Dropping 'Feature 12' with VIF value : 300.6222736376661
- Dropping 'Feature 10' with VIF value : 202.678484487964
- Dropping 'Feature 7' with VIF value : 141.40042641429838
- Dropping 'Feature 9' with VIF value : 106.76674138525257
- Dropping 'Feature 3' with VIF value : 86.02264209663117
- Dropping 'Feature 8' with VIF value : 67.31897419929773
- Dropping 'Feature 14' with VIF value : 58.7676190989684
- Dropping 'Feature 17' with VIF value : 13.797398636780125
- Dropping 'Feature 18' with VIF value : 8.976374644362185
- Dropping 'Feature 2' with VIF value : 5.840934708316343
- Since, the target variable is continuous in nature, the given problem is a regression problem and not a classification problem.
- The metrics used in case of Linear Regression is Root Mean Square Error (RMSE) which is the standard deviation of the residuals (prediction errors). Residuals are a measure of how far from the regression line data points are; A RMSE score closer to 0 means the model is well fit.
- A lot of features were heavily correlated with the target variable.
- However, those features were correlated with other features as well.
- There was a lot of multicollinearity in the dataset which if weren't tackled would have given unreliable weights for the features.