Dummy Variable Trap and its solution in Python

Here, we discuss a dummy variable trap and its solution. But first, we discuss dummy variables.

What is the dummy variable?

In the regression model, there are various types of data. We can differentiate it in mainly two categories Numerical and Categorical. The regression model easily handles numerical data nut it is difficult to handle categorical data. So, categorical data need to transform into numeric data. For handling this type of data one-hot encoding used in a linear regression model. In one hot encoding, it creates a new variable for each category. All variables containing 1 or 0.If the category is present then 1 else 0. For p different categories p new variable introduce. These variables called Dummy variables.

Dummy variable trap

The regression model contains dummy variables of categorical data after using one-hot encoding. The variables are highly correlated with each other which means one variable can predict from other variables. In the regression model, this variable creates a trap which is called the dummy variable trap. Including all variable result in redundant data.

Solution for dummy variable trap

The solution of the Dummy variable trap is to drop/remove one of the dummy variables. If there are p categories than p-1 dummy variable should use. The model should exclude one dummy variable.

Python Dummy variable trap and its solution

Here, with the help of the following example, the dummy variable trap can easily understand.

First, importing libraries and preparing datasets.

import pandas as pd
data=pd.read_csv('titanic.csv')
data.head()

Output:

Dummy Variable Trap and its solution in Python

Dropping unnecessary columns and null values.

data=data.drop(['Name','PassengerId','Ticket','Cabin'],axis=1)
data=data.dropna()
data.head()

Output:

Dummy Variable Trap and its solution in Python

Now, we have data in numeric and categorical form. Numeric values stay as it is. Categorical values use a one-hot encoding. Here, categorical values have a dummy variable trap while dropping its first column is the solution for the dummy variable trap. So that Pclass drop column of 1, sex drop a column of female, embarked drops column of c.

classes=pd.get_dummies(data['Pclass'],drop_first=True)
classes.head()

Output:

classes=pd.get_dummies(data['Pclass'],drop_first=True)

sex=pd.get_dummies(data['Sex'],drop_first=True)
sex.head()

Output:-

get_dummies(data['Sex'],drop_first=True)

embarked=pd.get_dummies(data['Embarked'],drop_first=True)
embarked.head()

Output:

get_dummies(data['Sex'],drop_first=True)

Merging all dummy variables with data.

data=pd.concat([data,classes,sex,embarked],axis=1)
data=data.drop(['Pclass','Sex','Embarked'],axis=1)
data.head()

Output:

data=pd.concat([data,classes,sex,embarked],axis=1)

Now, you can use machine learning for prediction.

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
X=data.drop("Survived",axis=1)
y=data["Survived"] 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
logreg=LogisticRegression()
logreg.fit(X_train,y_train)
predictions = logreg.predict(X_test)

Conclusion

In conclusion, we saw the following topic:

  • What is the dummy variable?
  • ¬†Dummy variable trap
  • Solution for dummy variable trap
  • Dummy variable trap and its solution in Python.

Also read: Dummy classifiers using sklearn library in Python

Leave a Reply

Your email address will not be published. Required fields are marked *