OneHotEncoder() function of sklearn – usage with example

Post Views: 875

In this tutorial, I will guide you through what One Hot Encoder is, when it is used, and how it is to be used. Using One Hot Encoder comes under Data Preprocessing, a crucial stage in Machine Learning.

One Hot Encoder

When using machine learning models on real-life datasets, you will encounter datasets that, apart from numerical values, also consist of categorical values. But we know that for a machine to understand the data, it should be in numbers. Otherwise, machine learning models fail to fit on the dataset, resulting in poor results. Thus, there is a need to convert these categorical data into numerical data, and one way of doing this is to use the OneHotEncoder() function from the sklearn library. It is easy to understand and use.

The advantage of using it is you provide more information to the machine-learning model as you are now using the categorical data, too, while the disadvantage is you can have an increase in dimensionality and more chances of overfitting.

You will understand better how One Hot Encoder works by the example below.

Step 1: Import Libraries

Importing all the necessary libraries.

import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

Step 2: Loading the dataset

I have made a sample dataset, and I am loading it here. We will understand the One Hot Encoding on this dataset.

dataset = pd.read_csv('data/Data.csv')
dataset

Have a look at the dataset:

Step 3: Extracting the X and y in the dataset

Here, first examine that the independent variables are Country, Age, and Salary columns while the dependent variable is the Purchased column. So our X will have columns till Salary, while y will have a Purchased column.

X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

print(X)
print(y)

Output:

[['France' 44.0 72000.0]  
['Spain' 27.0 48000.0]  
['Germany' 30.0 54000.0]  
['Spain' 38.0 61000.0]  
['Germany' 40.0 nan]  
['France' 35.0 58000.0]  
['Spain' nan 52000.0]  
['France' 48.0 79000.0]  
['Germany' 50.0 83000.0]  
['France' 37.0 67000.0]] 
['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']

Step 4: One Hot Encoding

Our independent variable includes categorical data in Country, while the other columns have numerical data. Therefore, we need to convert this categorical data into numerical data.

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough')
X = np.array(ct.fit_transform(X))
print(X)

Using ColumnTransformer to apply OneHotEncoder for the dataset is advised as it automatically handles the categorical data from specified columns if present in your dataset while keeping the remaining columns unchanged.