Data cleaning with scikit-learn in Python
Introduction: Whenever we solve a data science problem, almost every time we face these two problems first one is missing data and the second one is categorical data. In this article, We will study how to solve these problems, what are the tools and techniques and the hands-on coding part.
Simple imputer and label encoder: Data cleaning with scikit-learn in Python
Missing values: Well almost every time we can see this particular problem in our data-sets. Where some values are missing, they are “None” or “NaN”, To handle this kind of situation we use sk-learn’s imputer. There are serval imputer’s available. The first one is Imputer. We import it from the preprocessing class of sk-learn. First, we need to put hose missing values type then strategy then need to fit those particular columns. Let us see the coding part
import numpy as np import pandas as pd from sklearn.impute import SimpleImputer imputer = SimpleImputer(missing_values=np.nan,strategy = "mean") imputer.fit(x) print(imputer.transform(x))
Output :
#Before applying imputer:
age | score | |
---|---|---|
0 | 12 | 56.0 |
1 | 34 | 89.0 |
2 | 10 | 46.0 |
3 | 28 | 56.0 |
4 | 39 | 60.0 |
5 | 16 | 70.0 |
6 | 45 | NaN |
7 | 32 | 78.0 |
8 | 43 | 67.0 |
9 | 22 | 78.0 |
10 | 63 | NaN |
11 | 3 | 10.0 |
#After applying imputer
Leave a Reply