Dummy Variable Trap and its solution in Python
Here, we discuss a dummy variable trap and its solution. But first, we discuss dummy variables.
What is the dummy variable?
In the regression model, there are various types of data. We can differentiate it in mainly two categories Numerical and Categorical. The regression model easily handles numerical data nut it is difficult to handle categorical data. So, categorical data need to transform into numeric data. For handling this type of data one-hot encoding used in a linear regression model. In one hot encoding, it creates a new variable for each category. All variables containing 1 or 0.If the category is present then 1 else 0. For p different categories p new variable introduce. These variables called Dummy variables.
Dummy variable trap
The regression model contains dummy variables of categorical data after using one-hot encoding. The variables are highly correlated with each other which means one variable can predict from other variables. In the regression model, this variable creates a trap which is called the dummy variable trap. Including all variable result in redundant data.
Solution for dummy variable trap
The solution of the Dummy variable trap is to drop/remove one of the dummy variables. If there are p categories than p-1 dummy variable should use. The model should exclude one dummy variable.
Python Dummy variable trap and its solution
Here, with the help of the following example, the dummy variable trap can easily understand.
First, importing libraries and preparing datasets.
import pandas as pd data=pd.read_csv('titanic.csv') data.head()
Dropping unnecessary columns and null values.
data=data.drop(['Name','PassengerId','Ticket','Cabin'],axis=1) data=data.dropna() data.head()
Now, we have data in numeric and categorical form. Numeric values stay as it is. Categorical values use a one-hot encoding. Here, categorical values have a dummy variable trap while dropping its first column is the solution for the dummy variable trap. So that Pclass drop column of 1, sex drop a column of female, embarked drops column of c.
Merging all dummy variables with data.
data=pd.concat([data,classes,sex,embarked],axis=1) data=data.drop(['Pclass','Sex','Embarked'],axis=1) data.head()
Now, you can use machine learning for prediction.
from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression X=data.drop("Survived",axis=1) y=data["Survived"] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) logreg=LogisticRegression() logreg.fit(X_train,y_train) predictions = logreg.predict(X_test)
In conclusion, we saw the following topic:
- What is the dummy variable?
- Dummy variable trap
- Solution for dummy variable trap
- Dummy variable trap and its solution in Python.