Handle Missing Values using Various Methods in Python

Post Views: 753

In this tutorial, we will learn how to handle missing values in the data we get for analysis using Python programming language. Detection and handling of missing values is a very crucial step in the data-preprocessing step and helps make data more efficient and interpretable for further processing.

Before we begin the preprocessing, we would need data to work with. For data, we can import/download data from the internet or we can create the data ourselves. In this tutorial, we will be creating a simple data frame and introducing some missing values intentionally. Have a look at the code snippet below.

import pandas as pd
import numpy as np
import random

NoOfRows = 100000
NoOfProducts = 100
NoOfCustomers = 5000
product_Ids = ['P' + str(i).zfill(4) for i in range(1, NoOfProducts + 1)]
customer_Ids = ['C' + str(i).zfill(6) for i in range(1, NoOfCustomers + 1)]
dates_Data = pd.date_range(start='2022-01-01', end='2023-12-31', freq='D')

sales_data = []
for _ in range(NoOfRows):
    date = random.choice(dates_Data)
    product_id = random.choice(product_Ids)
    quantity_sold = random.randint(1, 10) if random.random() < 0.9 else np.nan 
    price = round(random.uniform(10, 100), 2) if random.random() < 0.999 else np.nan 
    customer_id = random.choice(customer_Ids) if random.random() < 0.99 else np.nan 
    sales_data.append([date, product_id, quantity_sold, price, customer_id])

sales_df = pd.DataFrame(sales_data, columns=['Date', 'Product_ID', 'Quantity_Sold', 'Price', 'Customer_ID'])
sales_df.to_csv('sales_data.csv', index=False)

We will be generating Sales Data with some columns having random data under each column/feature of the data frame. To purposely introduce missing values we will be using the code snippet: if random.random() < 0.9 else np.nan which will add NaN value if the value of random.random exceeds the value 0.9.

Let’s load the dataset created into the code and print the first five rows of the data (head) using the code snippet below:

df = pd.read_csv('sales_data.csv')
print(df.head())

The output of the code that gets displayed is as follows:

         Date Product_ID  Quantity_Sold  Price Customer_ID
0  2022-12-08      P0030           10.0  59.67     C002488
1  2023-06-03      P0018            2.0  51.83     C004673
2  2023-04-16      P0037            7.0  72.91     C000139
3  2023-03-17      P0008            NaN  87.38     C004899
4  2023-06-22      P0086            1.0  75.51     C002807

In Python, missing values are often represented as NaN (Not a Number) or None. As we can see there are already some NaN values present in the first five rows itself. Now that we have the dataset ready, let’s move on to the process of handling the missing values (NaN) in the dataset.

Step 1 – Identifying the Number of Missing Values in Data

Before learning how to handle missing values, we need to recognize the number of missing values present in each column using the function mentioned in the code snippet below.

print(df.isnull().sum())

This will result in printing the number of missing values against each column and the output in this case is:

Date                0
Product_ID          0
Quantity_Sold    9967
Price             111
Customer_ID      1014
dtype: int64

In this case, we have the following observations:

“Date: 0” indicates that there are no missing values in the “Date” column.
“Product_ID: 0” indicates that there are no missing values in the “Product_ID” column.
“Quantity_Sold: 9967” indicates that there are 9967 missing values in the “Quantity_Sold” column.
“Price: 111” indicates that there are 111 missing values in the “Price” column.
“Customer_ID: 1014” indicates that there are 1014 missing values in the “Customer_ID” column.]

Step 2 – Handle Missing Values in the Dataset

Now this section will cover various methods of removal/handling of the missing values present in the dataset. In this tutorial, we will discuss the following approaches:

Dropping/ Removing Missing Values
Imputation
- Mean Imputation
- Median Imputation
- Mode Imputation
- Constant Imputation
Forward and Backward Fill
Interpolation
- Linear Interpolation
- Polynomial Interpolation
- Spline Interpolation

Method 1 – Dropping/ Removing Missing Values

One of the most basic methods for handling missing values is simply removing the rows with missing values from the dataset. We must use this method when the amount of missing data is relatively small in number. We implement this method using the code snippet below.

print(f"Number of rows before removal: {df.shape[0]}")
df_rowRemoval = df.dropna()
print(f"Number of rows after removal: {df_rowRemoval.shape[0]}")

The resulting output of the code is as follows:

Number of rows before removal: 100000
Number of rows after removal: 89024

Advantages of this Method: This method is simple straightforward and it also maintains the integrity of the data with no missing values.

Disadvantages of this Method: This method can end up losing a significant amount of data (data loss) especially if a lot of rows are getting removed. In this case, around 2k rows are getting removed, which is a huge data loss.

Method 2 – Imputation

Imputation means replacing the missing value with a substitution value. We make use of this method when we want to retain the integrity of the dataset and don’t want to lose any data row and just fill in missing values.

Advantage of this method: This method retains the size and integrity of the dataset, and can be more efficient than removing missing data.

Disadvantages of this method: Adding new values to the dataset can introduce an error/fluctuation factor in the dataset resulting in incorrect analysis. This method can also manipulate the distribution of data

There are various methods for Imputation implementation :

Mean Imputation – Calculate the mean of the values in the column and replace missing values with this mean.
Median Imputation – Calculate the median of the values in the column and replace missing values with this median.
Mode Imputation – Calculate the mode of the values in the column and replace missing values with this mode.
Constant Imputation – Simply replace missing values with the chosen constant value.

imputedDF = df.copy()
imputedDF['Quantity_Sold'].fillna(df['Quantity_Sold'].mean(), inplace=True)
imputedDF['Price'].fillna(df['Price'].median(), inplace=True)
imputedDF['Customer_ID'].fillna(value=0, inplace=True)

print(imputedDF.isnull().sum())

We will be replacing/filling the missing values in the column Quantity_Sold with the mean of the column, Price with the median, and Customer_ID with a constant value. We can also replace any value with the mode using the code snippet: df[‘column_name’].fillna(df[‘column_name’].mode()[0], inplace=True) where we replace column_name with any column where we need to perform handling missing values operation.

Method 3 – Forward and Backward Fill

In this method, we will make use of the already present values instead of computing new values/ introducing new values in the dataset. For forward fill, we will replace each value with the last observed value (the value before). Similarly for backward fill, we will replace each value with the next observed value(the value after).

The code snippet below will implement forward fill for the dataset.

print(df.tail(n=10))

forwardFillDF = df.copy()
forwardFillDF.fillna(method='ffill', inplace=True)

print("\n",forwardFillDF.tail(n=10))

The output of the code is as follows. In the output below, I have highlighted certain values to represent the change in the dataset before and after the forward-filling implementation. As we noticed the value before is filled as the missing values in the new dataset.

The code snippet below will implement backward-fill for the dataset.

print(df.tail(n=10))

backwardFillDF = df.copy()
backwardFillDF.fillna(method='bfill', inplace=True)

print("\n",backwardFillDF.tail(n=10))

The output of the code is as follows. In the output below, I have highlighted certain values to represent the change in the dataset before and after the backward-filling implementation. As we noticed the value after is filled as the missing values in the new dataset.

Method 4 – Interpolation

Another approach that we can take is instead of taking ALL the data points under consideration, we will only consider the neighboring points around the missing value. This approach is known as Interpolation. We will be understanding three different interpolation techniques: linear interpolation, polynomial interpolation, and spline interpolation. This method is useful when a trend/pattern is being followed among different data points in various columns/features.

Advantages of Method: This method helps in preserving the overall trend being followed in the data.

Disadvantages of Method: If no regular trend/pattern is being followed in the data then the interpolation technique can end up making things even worse and making the handling much more complex.

Linear Interpolation

Linear interpolation estimates missing values by assuming a linear relationship between neighboring data points. Have a look at the code snippet below to understand the implementation of interpolation:

print(df.tail(n=10))

linerInterpolarDF = df.copy()
linerInterpolarDF.interpolate(method='linear', inplace=True)

print("\n",linerInterpolarDF.tail(n=10))

The output of the code is as follows:

             Date Product_ID  Quantity_Sold  Price Customer_ID
99990  2023-11-26      P0019            8.0  59.51     C001916
99991  2022-08-11      P0039            6.0  51.54     C002941
99992  2023-08-19      P0071            8.0  37.88     C004946
99993  2023-05-01      P0049            2.0  36.03     C001286
99994  2022-01-22      P0022            NaN  98.51     C002680
99995  2023-12-09      P0050            9.0  84.26     C000869
99996  2023-12-16      P0008            2.0  36.01         NaN
99997  2023-06-12      P0006            3.0  85.60     C002846
99998  2023-04-19      P0066            NaN  55.79     C001906
99999  2022-10-18      P0030            8.0  41.87     C002045

              Date Product_ID  Quantity_Sold  Price Customer_ID
99990  2023-11-26      P0019            8.0  59.51     C001916
99991  2022-08-11      P0039            6.0  51.54     C002941
99992  2023-08-19      P0071            8.0  37.88     C004946
99993  2023-05-01      P0049            2.0  36.03     C001286
99994  2022-01-22      P0022            5.5  98.51     C002680
99995  2023-12-09      P0050            9.0  84.26     C000869
99996  2023-12-16      P0008            2.0  36.01         NaN
99997  2023-06-12      P0006            3.0  85.60     C002846
99998  2023-04-19      P0066            5.5  55.79     C001906
99999  2022-10-18      P0030            8.0  41.87     C002045

Polynomial Interpolation

Polynomial interpolation fits a polynomial function on the data points and estimates missing values based on this polynomial. As in this method, we are dealing with a polynomial that mathematically has a degree/order, and we need to specify the same in the function as well. Have a look at the code snippet below to understand the implementation of interpolation:

print(df.tail(n=10))

polynomialInterpolarDF = df.copy()
polynomialInterpolarDF.interpolate(method='polynomial', order=2, inplace=True)

print("\n",polynomialInterpolarDF.tail(n=10))

The output of the code is as follows:

             Date Product_ID  Quantity_Sold  Price Customer_ID
99990  2023-11-26      P0019            8.0  59.51     C001916
99991  2022-08-11      P0039            6.0  51.54     C002941
99992  2023-08-19      P0071            8.0  37.88     C004946
99993  2023-05-01      P0049            2.0  36.03     C001286
99994  2022-01-22      P0022            NaN  98.51     C002680
99995  2023-12-09      P0050            9.0  84.26     C000869
99996  2023-12-16      P0008            2.0  36.01         NaN
99997  2023-06-12      P0006            3.0  85.60     C002846
99998  2023-04-19      P0066            NaN  55.79     C001906
99999  2022-10-18      P0030            8.0  41.87     C002045

              Date Product_ID  Quantity_Sold  Price Customer_ID
99990  2023-11-26      P0019       8.000000  59.51     C001916
99991  2022-08-11      P0039       6.000000  51.54     C002941
99992  2023-08-19      P0071       8.000000  37.88     C004946
99993  2023-05-01      P0049       2.000000  36.03     C001286
99994  2022-01-22      P0022       5.649908  98.51     C002680
99995  2023-12-09      P0050       9.000000  84.26     C000869
99996  2023-12-16      P0008       2.000000  36.01         NaN
99997  2023-06-12      P0006       3.000000  85.60     C002846
99998  2023-04-19      P0066       5.543695  55.79     C001906
99999  2022-10-18      P0030       8.000000  41.87     C002045

Spline Interpolation

Spline interpolation fits piecewise polynomial functions known as spline on the data points. As this method also deals with polynomials we need to specify the same in the function as well. Have a look at the code snippet below to understand the implementation of interpolation:

print(df.tail(n=10))

splineInterpolarDF = df.copy()
splineInterpolarDF.interpolate(method='spline', order=2, inplace=True)

print("\n",splineInterpolarDF.tail(n=10))

When executed, this interpolation technique takes a lot of time hence making this method expensive to implement. After several minutes, we get the following output:

             Date Product_ID  Quantity_Sold  Price Customer_ID
99990  2023-11-26      P0019            8.0  59.51     C001916
99991  2022-08-11      P0039            6.0  51.54     C002941
99992  2023-08-19      P0071            8.0  37.88     C004946
99993  2023-05-01      P0049            2.0  36.03     C001286
99994  2022-01-22      P0022            NaN  98.51     C002680
99995  2023-12-09      P0050            9.0  84.26     C000869
99996  2023-12-16      P0008            2.0  36.01         NaN
99997  2023-06-12      P0006            3.0  85.60     C002846
99998  2023-04-19      P0066            NaN  55.79     C001906
99999  2022-10-18      P0030            8.0  41.87     C002045

              Date Product_ID  Quantity_Sold  Price Customer_ID
99990  2023-11-26      P0019       8.000000  59.51     C001916
99991  2022-08-11      P0039       6.000000  51.54     C002941
99992  2023-08-19      P0071       8.000000  37.88     C004946
99993  2023-05-01      P0049       2.000000  36.03     C001286
99994  2022-01-22      P0022      -0.240287  98.51     C002680
99995  2023-12-09      P0050       9.000000  84.26     C000869
99996  2023-12-16      P0008       2.000000  36.01         NaN
99997  2023-06-12      P0006       3.000000  85.60     C002846
99998  2023-04-19      P0066      17.363485  55.79     C001906
99999  2022-10-18      P0030       8.000000  41.87     C002045

Please note, that you must have noticed that the interpolation method doesn’t work for non-numeric columns (in this case, Customer_ID) which makes it very less efficient for columns where no trends can be found.

I hope now you can remove and handle missing values from a dataset like a pro! Thank you for reading.

Have a read at the following tutorials as well:

Happy Learning!