Forward Elimination in Machine Learning – Python

In this tutorial, we are going to learn the forward elimination method or forward stepwise selection method in machine learning in Python. firstly we will see what is it and next we will see about implementation in Python

Forward Elimination

Instead of including all the predictors in the model, we can remove the least significant variables(predictors) before applying the model. So that we can improve the model interoperability. In other words, By removing irrelevant features we can obtain a model that is more easily interpreted.

Ref: https://web.stanford.edu/~hastie/MOOC-Slides/model_selection.pdf

Implementation

For this implementation, we are using the GDP dataset that is downloaded from here GDP dataset CSV file

firstly, install the modules by ‘pip install sklearn‘ and ‘pip install mlxtend‘ and import these modules.

import pandas as pd
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.linear_model import LinearRegression

The MLXTEND is the package that has builtin functions for selection techniques.

adding our dataset.

data=pd.read_csv(r"C:\Users\monis\Desktop\dataset.csv")
data.head()

In this data, GDP_growth rate is the response and others are predictors. Now separation of predictors and response.

x=data[data.columns[2:8]]
y=data[data.columns[1]]

Now we can select the features based on the accuracy of the linear regression model fit.

sfs = SFS(LinearRegression(),k_features=5,forward=True,floating=False,scoring = 'r2',cv = 0)

Arguments:

  • LinearRegression() is for estimator for the process
  • k_features is the number of features to be selected.
  • Then for the Forward elimination, we use forward =true and floating =false.
  • The scoring argument is for evaluation criteria to be used. or regression problems, there is only r2 score in default implementation.
  • cv the argument is for K-fold cross-validation.

Then we will apply this model to fit the data.

sfs.fit(x,y)

Output:

SequentialFeatureSelector(clone_estimator=True, cv=0,
                          estimator=LinearRegression(copy_X=True,
                                                     fit_intercept=True,
                                                     n_jobs=None,
                                                     normalize=False),
                          fixed_features=None, floating=False, forward=True,
                          k_features=5, n_jobs=1, pre_dispatch='2*n_jobs',
                          scoring='r2', verbose=0)

Now we can see which are the 5 features that show a significant change in the model.

sfs.k_feature_names_

Output:

('Agri-cultureAlliedServicesGrowthRate',
 'Agriculture-%GrowthRate',
 'Industry-%GrowthRate',
 'Mining_Quarrying-%GrowthRate',
 'Services-%GrowthRate')

Now, let us see the accuracy for these 5 best features.

sfs.k_score_

Output:

0.9678419438379969

Now let us try for the three best features.

sfs = SFS(LinearRegression(),k_features=3,forward=True,floating=False,scoring = 'r2',cv = 0)
sfs.fit(x,y)
sfs.k_feature_names_
sfs.k_score_

Output:

('Agri-cultureAlliedServicesGrowthRate',
 'Industry-%GrowthRate',
 'Services-%GrowthRate')
0.9656863448203433

The previous one was with an accuracy of 96.78% and now we got 96.56%. By this, We can conclude that by adding more number of features, the accuracy rate will also increase.

Also read: Wine Quality Prediction using Machine Learning in Python

Leave a Reply

Your email address will not be published. Required fields are marked *