Handle Missing Values using Various Methods in Python
In this tutorial, we will learn how to handle missing values in the data we get for analysis using Python programming language. Detection and handling of missing values is a very crucial step in the data-preprocessing step and helps make data more efficient and interpretable for further processing.
Before we begin the preprocessing, we would need data to work with. For data, we can import/download data from the internet or we can create the data ourselves. In this tutorial, we will be creating a simple data frame and introducing some missing values intentionally. Have a look at the code snippet below.
import pandas as pd import numpy as np import random NoOfRows = 100000 NoOfProducts = 100 NoOfCustomers = 5000 product_Ids = ['P' + str(i).zfill(4) for i in range(1, NoOfProducts + 1)] customer_Ids = ['C' + str(i).zfill(6) for i in range(1, NoOfCustomers + 1)] dates_Data = pd.date_range(start='2022-01-01', end='2023-12-31', freq='D') sales_data = [] for _ in range(NoOfRows): date = random.choice(dates_Data) product_id = random.choice(product_Ids) quantity_sold = random.randint(1, 10) if random.random() < 0.9 else np.nan price = round(random.uniform(10, 100), 2) if random.random() < 0.999 else np.nan customer_id = random.choice(customer_Ids) if random.random() < 0.99 else np.nan sales_data.append([date, product_id, quantity_sold, price, customer_id]) sales_df = pd.DataFrame(sales_data, columns=['Date', 'Product_ID', 'Quantity_Sold', 'Price', 'Customer_ID']) sales_df.to_csv('sales_data.csv', index=False)
We will be generating Sales Data
with some columns having random data under each column/feature of the data frame. To purposely introduce missing values we will be using the code snippet: if random.random() < 0.9 else np.nan
which will add NaN
value if the value of random.random
exceeds the value 0.9
.
Let’s load the dataset created into the code and print the first five rows of the data (head
) using the code snippet below:
df = pd.read_csv('sales_data.csv') print(df.head())
The output of the code that gets displayed is as follows:
Date Product_ID Quantity_Sold Price Customer_ID 0 2022-12-08 P0030 10.0 59.67 C002488 1 2023-06-03 P0018 2.0 51.83 C004673 2 2023-04-16 P0037 7.0 72.91 C000139 3 2023-03-17 P0008 NaN 87.38 C004899 4 2023-06-22 P0086 1.0 75.51 C002807
In Python, missing values are often represented as NaN (Not a Number) or None
. As we can see there are already some NaN
values present in the first five rows itself. Now that we have the dataset ready, let’s move on to the process of handling the missing values (NaN
) in the dataset.
Step 1 – Identifying the Number of Missing Values in Data
Before learning how to handle missing values, we need to recognize the number of missing values present in each column using the function mentioned in the code snippet below.
print(df.isnull().sum())
This will result in printing the number of missing values against each column and the output in this case is:
Date 0 Product_ID 0 Quantity_Sold 9967 Price 111 Customer_ID 1014 dtype: int64
In this case, we have the following observations:
- “Date: 0” indicates that there are no missing values in the “Date” column.
- “Product_ID: 0” indicates that there are no missing values in the “Product_ID” column.
- “Quantity_Sold: 9967” indicates that there are 9967 missing values in the “Quantity_Sold” column.
- “Price: 111” indicates that there are 111 missing values in the “Price” column.
- “Customer_ID: 1014” indicates that there are 1014 missing values in the “Customer_ID” column.]
Step 2 – Handle Missing Values in the Dataset
Now this section will cover various methods of removal/handling of the missing values present in the dataset. In this tutorial, we will discuss the following approaches:
- Dropping/ Removing Missing Values
- Imputation
- Mean Imputation
- Median Imputation
- Mode Imputation
- Constant Imputation
- Forward and Backward Fill
- Interpolation
- Linear Interpolation
- Polynomial Interpolation
- Spline Interpolation
Method 1 – Dropping/ Removing Missing Values
One of the most basic methods for handling missing values is simply removing the rows with missing values from the dataset. We must use this method when the amount of missing data is relatively small in number. We implement this method using the code snippet below.
print(f"Number of rows before removal: {df.shape[0]}") df_rowRemoval = df.dropna() print(f"Number of rows after removal: {df_rowRemoval.shape[0]}")
The resulting output of the code is as follows:
Number of rows before removal: 100000 Number of rows after removal: 89024
Advantages of this Method: This method is simple straightforward and it also maintains the integrity of the data with no missing values.
Disadvantages of this Method: This method can end up losing a significant amount of data (data loss) especially if a lot of rows are getting removed. In this case, around 2k rows are getting removed, which is a huge data loss.
Method 2 – Imputation
Imputation means replacing the missing value with a substitution value. We make use of this method when we want to retain the integrity of the dataset and don’t want to lose any data row and just fill in missing values.
Advantage of this method: This method retains the size and integrity of the dataset, and can be more efficient than removing missing data.
Disadvantages of this method: Adding new values to the dataset can introduce an error/fluctuation factor in the dataset resulting in incorrect analysis. This method can also manipulate the distribution of data
There are various methods for Imputation implementation :
- Mean Imputation – Calculate the mean of the values in the column and replace missing values with this mean.
- Median Imputation – Calculate the median of the values in the column and replace missing values with this median.
- Mode Imputation – Calculate the mode of the values in the column and replace missing values with this mode.
- Constant Imputation – Simply replace missing values with the chosen constant value.
imputedDF = df.copy() imputedDF['Quantity_Sold'].fillna(df['Quantity_Sold'].mean(), inplace=True) imputedDF['Price'].fillna(df['Price'].median(), inplace=True) imputedDF['Customer_ID'].fillna(value=0, inplace=True) print(imputedDF.isnull().sum())
We will be replacing/filling the missing values in the column Quantity_Sold
with the mean of the column, Price
with the median, and Customer_ID
with a constant value. We can also replace any value with the mode using the code snippet: df[‘column_name’].fillna(df[‘column_name’].mode()[0], inplace=True) where we replace column_name with any column where we need to perform handling missing values operation.
Method 3 – Forward and Backward Fill
In this method, we will make use of the already present values instead of computing new values/ introducing new values in the dataset. For forward fill, we will replace each value with the last observed value (the value before). Similarly for backward fill, we will replace each value with the next observed value(the value after).
The code snippet below will implement forward fill for the dataset.
print(df.tail(n=10)) forwardFillDF = df.copy() forwardFillDF.fillna(method='ffill', inplace=True) print("\n",forwardFillDF.tail(n=10))
The output of the code is as follows. In the output below, I have highlighted certain values to represent the change in the dataset before and after the forward-filling implementation. As we noticed the value before is filled as the missing values in the new dataset.
The code snippet below will implement backward-fill for the dataset.
print(df.tail(n=10)) backwardFillDF = df.copy() backwardFillDF.fillna(method='bfill', inplace=True) print("\n",backwardFillDF.tail(n=10))
The output of the code is as follows. In the output below, I have highlighted certain values to represent the change in the dataset before and after the backward-filling implementation. As we noticed the value after is filled as the missing values in the new dataset.
Method 4 – Interpolation
Another approach that we can take is instead of taking ALL the data points under consideration, we will only consider the neighboring points around the missing value. This approach is known as Interpolation. We will be understanding three different interpolation techniques: linear interpolation, polynomial interpolation, and spline interpolation. This method is useful when a trend/pattern is being followed among different data points in various columns/features.
Advantages of Method: This method helps in preserving the overall trend being followed in the data.
Disadvantages of Method: If no regular trend/pattern is being followed in the data then the interpolation technique can end up making things even worse and making the handling much more complex.
Linear Interpolation
Linear interpolation estimates missing values by assuming a linear relationship between neighboring data points. Have a look at the code snippet below to understand the implementation of interpolation:
print(df.tail(n=10)) linerInterpolarDF = df.copy() linerInterpolarDF.interpolate(method='linear', inplace=True) print("\n",linerInterpolarDF.tail(n=10))
The output of the code is as follows:
Date Product_ID Quantity_Sold Price Customer_ID 99990 2023-11-26 P0019 8.0 59.51 C001916 99991 2022-08-11 P0039 6.0 51.54 C002941 99992 2023-08-19 P0071 8.0 37.88 C004946 99993 2023-05-01 P0049 2.0 36.03 C001286 99994 2022-01-22 P0022 NaN 98.51 C002680 99995 2023-12-09 P0050 9.0 84.26 C000869 99996 2023-12-16 P0008 2.0 36.01 NaN 99997 2023-06-12 P0006 3.0 85.60 C002846 99998 2023-04-19 P0066 NaN 55.79 C001906 99999 2022-10-18 P0030 8.0 41.87 C002045 Date Product_ID Quantity_Sold Price Customer_ID 99990 2023-11-26 P0019 8.0 59.51 C001916 99991 2022-08-11 P0039 6.0 51.54 C002941 99992 2023-08-19 P0071 8.0 37.88 C004946 99993 2023-05-01 P0049 2.0 36.03 C001286 99994 2022-01-22 P0022 5.5 98.51 C002680 99995 2023-12-09 P0050 9.0 84.26 C000869 99996 2023-12-16 P0008 2.0 36.01 NaN 99997 2023-06-12 P0006 3.0 85.60 C002846 99998 2023-04-19 P0066 5.5 55.79 C001906 99999 2022-10-18 P0030 8.0 41.87 C002045
Polynomial Interpolation
Polynomial interpolation fits a polynomial function on the data points and estimates missing values based on this polynomial. As in this method, we are dealing with a polynomial that mathematically has a degree/order, and we need to specify the same in the function as well. Have a look at the code snippet below to understand the implementation of interpolation:
print(df.tail(n=10)) polynomialInterpolarDF = df.copy() polynomialInterpolarDF.interpolate(method='polynomial', order=2, inplace=True) print("\n",polynomialInterpolarDF.tail(n=10))
The output of the code is as follows:
Date Product_ID Quantity_Sold Price Customer_ID 99990 2023-11-26 P0019 8.0 59.51 C001916 99991 2022-08-11 P0039 6.0 51.54 C002941 99992 2023-08-19 P0071 8.0 37.88 C004946 99993 2023-05-01 P0049 2.0 36.03 C001286 99994 2022-01-22 P0022 NaN 98.51 C002680 99995 2023-12-09 P0050 9.0 84.26 C000869 99996 2023-12-16 P0008 2.0 36.01 NaN 99997 2023-06-12 P0006 3.0 85.60 C002846 99998 2023-04-19 P0066 NaN 55.79 C001906 99999 2022-10-18 P0030 8.0 41.87 C002045 Date Product_ID Quantity_Sold Price Customer_ID 99990 2023-11-26 P0019 8.000000 59.51 C001916 99991 2022-08-11 P0039 6.000000 51.54 C002941 99992 2023-08-19 P0071 8.000000 37.88 C004946 99993 2023-05-01 P0049 2.000000 36.03 C001286 99994 2022-01-22 P0022 5.649908 98.51 C002680 99995 2023-12-09 P0050 9.000000 84.26 C000869 99996 2023-12-16 P0008 2.000000 36.01 NaN 99997 2023-06-12 P0006 3.000000 85.60 C002846 99998 2023-04-19 P0066 5.543695 55.79 C001906 99999 2022-10-18 P0030 8.000000 41.87 C002045
Spline Interpolation
Spline interpolation fits piecewise polynomial functions known as spline on the data points. As this method also deals with polynomials we need to specify the same in the function as well. Have a look at the code snippet below to understand the implementation of interpolation:
print(df.tail(n=10)) splineInterpolarDF = df.copy() splineInterpolarDF.interpolate(method='spline', order=2, inplace=True) print("\n",splineInterpolarDF.tail(n=10))
When executed, this interpolation technique takes a lot of time hence making this method expensive to implement. After several minutes, we get the following output:
Date Product_ID Quantity_Sold Price Customer_ID 99990 2023-11-26 P0019 8.0 59.51 C001916 99991 2022-08-11 P0039 6.0 51.54 C002941 99992 2023-08-19 P0071 8.0 37.88 C004946 99993 2023-05-01 P0049 2.0 36.03 C001286 99994 2022-01-22 P0022 NaN 98.51 C002680 99995 2023-12-09 P0050 9.0 84.26 C000869 99996 2023-12-16 P0008 2.0 36.01 NaN 99997 2023-06-12 P0006 3.0 85.60 C002846 99998 2023-04-19 P0066 NaN 55.79 C001906 99999 2022-10-18 P0030 8.0 41.87 C002045 Date Product_ID Quantity_Sold Price Customer_ID 99990 2023-11-26 P0019 8.000000 59.51 C001916 99991 2022-08-11 P0039 6.000000 51.54 C002941 99992 2023-08-19 P0071 8.000000 37.88 C004946 99993 2023-05-01 P0049 2.000000 36.03 C001286 99994 2022-01-22 P0022 -0.240287 98.51 C002680 99995 2023-12-09 P0050 9.000000 84.26 C000869 99996 2023-12-16 P0008 2.000000 36.01 NaN 99997 2023-06-12 P0006 3.000000 85.60 C002846 99998 2023-04-19 P0066 17.363485 55.79 C001906 99999 2022-10-18 P0030 8.000000 41.87 C002045
Please note, that you must have noticed that the interpolation method doesn’t work for non-numeric columns (in this case, Customer_ID
) which makes it very less efficient for columns where no trends can be found.
I hope now you can remove and handle missing values from a dataset like a pro! Thank you for reading.
Have a read at the following tutorials as well:
- Data preprocessing (splitting dataset before training model)
- Data cleaning with scikit-learn in Python
- sklearn.preprocessing.normalize in Python
Happy Learning!
Leave a Reply