Data cleaning with scikit-learn in Python

Introduction: Whenever we solve a data science problem, almost every time we face these two problems first one is missing data and the second one is categorical data. In this article, We will study how to solve these problems, what are the tools and techniques and the hands-on coding part.

Simple imputer and label encoder: Data cleaning with scikit-learn in Python

Missing values: Well almost every time we can see this particular problem in our data-sets. Where some values are missing, they are “None” or “NaN”, To handle this kind of situation we use sk-learn’s imputer. There are serval imputer’s available. The first one is Imputer. We import it from the preprocessing class of sk-learn. First, we need to put hose missing values type then strategy then need to fit those particular columns. Let us see the coding part

import numpy as np
import pandas as pd

from sklearn.impute import SimpleImputer 
imputer = SimpleImputer(missing_values=np.nan,strategy = "mean") 
imputer.fit(x) 
print(imputer.transform(x))

Output :

#Before applying imputer:

agescore
01256.0
13489.0
21046.0
32856.0
43960.0
51670.0
645NaN
73278.0
84367.0
92278.0
1063NaN
11310.0

#After applying imputer

[[12. 56.] 
[34. 89.] 
[10. 46.] 
[28. 56.] 
[39. 60.] 
[16. 70.] 
[45. 61.] 
[32. 78.] 
[43. 67.] 
[22. 78.] 
[63. 61.] 
[ 3. 10.]]

Categorical data: To handle categorical data sklearn provides label encoder which works in a numerical manner for these kinds of data. We can import it from preprocessing. Let us see the coding part

from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
y  = encoder.fit_transform(x["account / Not"])
print(y)

output :

#Before applying encoder:

0     yes
1      no
2     yes
3      no
4      no
5      no
6      no
7     yes
8     yes
9     yes
10     no
11    yes
12     no
13     no

#After encoder:
array([1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0])

Also read: Scikit-learn accuracy score

Leave a Reply