Label encoding of datasets in Python

Hey guys, in this tutorial we will learn about label encoding of datasets in Python. Normally in machine learning algorithms, when we import a dataset, it consists of many categorical variables. These variables are most often in the form of words. Since our machine learning model can only process numerical data, these variables need to be converted in numeric labels. As a preprocessing step, we use label encoding for this task. Let’s understand this in detail.

Label Encoding of datasets

Let’s say we have a dataset with a column that contains values good, average and bad. Now we preprocess this data and encode the dataset such that good, average and bad are replaced with 0, 1, 2 respectively. Since the new values assigned are labels, we call this method Label Encoding. This is a very important step in supervised learning.

Now it’s time to understand it with a real-world example.

First, let’s download a dataset. The dataset that we will be using to explain label encoding is ’50 startups’. The link to download this dataset is given here: https://www.kaggle.com/farhanmd29/50-startups/download

Now let’s move to the coding part.

Step 1: Importing the dataset

Importing the dataset will require the pandas library. We are using ‘as’ keyword here to use it as pd. Now we use the read_csv() method to import the dataset. See the code given here.

import pandas as pd
dataset = pd.read_csv('50_Startups.csv')
dataset.head(5)

Output:

dataset table
As you can see in the output, we have a ‘State’ column that has values as the names of different states of the USA. Now our system cannot process this data properly in the current format. This is why we are going to label encode this data as you will see in the next step.

Step 2: Label Encoding

For label encoding, we need to import LabelEncoder as shown below. Then we create an object of this class that is used to call fit_transform() method to encode the state column of the given datasets.

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

dataset['State'] = le.fit_transform(dataset['State'])

dataset.head(5)

It is pretty much clear from the output that we have successfully label encoded our data.

The drawback of using Label Encoding

As we have seen, Label encoding assigns a new number starting from 0 to every distinct value. Now the problem with this method is that in a machine learning model, values with greater numerical value can have a greater significance which may lead to inaccuracies in our model. To solve this problem we can use one-hot encoding.

Thank you.

Also read: Import dataset using Pandas (Python deep learning library )

Leave a Reply