Forward Elimination in Machine Learning – Python
In this tutorial, we are going to learn the forward elimination method or forward stepwise selection method in machine learning in Python. firstly we will see what is it and next we will see about implementation in Python
Instead of including all the predictors in the model, we can remove the least significant variables(predictors) before applying the model. So that we can improve the model interoperability. In other words, By removing irrelevant features we can obtain a model that is more easily interpreted.
For this implementation, we are using the GDP dataset that is downloaded from here GDP dataset CSV file
firstly, install the modules by ‘pip install sklearn‘ and ‘pip install mlxtend‘ and import these modules.
import pandas as pd from mlxtend.feature_selection import SequentialFeatureSelector as SFS from sklearn.linear_model import LinearRegression
The MLXTEND is the package that has builtin functions for selection techniques.
adding our dataset.
In this data, GDP_growth rate is the response and others are predictors. Now separation of predictors and response.
Now we can select the features based on the accuracy of the linear regression model fit.
sfs = SFS(LinearRegression(),k_features=5,forward=True,floating=False,scoring = 'r2',cv = 0)
- LinearRegression() is for estimator for the process
- k_features is the number of features to be selected.
- Then for the Forward elimination, we use forward =true and floating =false.
- The scoring argument is for evaluation criteria to be used. or regression problems, there is only r2 score in default implementation.
- cv the argument is for K-fold cross-validation.
Then we will apply this model to fit the data.
SequentialFeatureSelector(clone_estimator=True, cv=0, estimator=LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False), fixed_features=None, floating=False, forward=True, k_features=5, n_jobs=1, pre_dispatch='2*n_jobs', scoring='r2', verbose=0)
Now we can see which are the 5 features that show a significant change in the model.
('Agri-cultureAlliedServicesGrowthRate', 'Agriculture-%GrowthRate', 'Industry-%GrowthRate', 'Mining_Quarrying-%GrowthRate', 'Services-%GrowthRate')
Now, let us see the accuracy for these 5 best features.
Now let us try for the three best features.
sfs = SFS(LinearRegression(),k_features=3,forward=True,floating=False,scoring = 'r2',cv = 0) sfs.fit(x,y) sfs.k_feature_names_ sfs.k_score_
('Agri-cultureAlliedServicesGrowthRate', 'Industry-%GrowthRate', 'Services-%GrowthRate')
The previous one was with an accuracy of 96.78% and now we got 96.56%. By this, We can conclude that by adding more number of features, the accuracy rate will also increase.