Detect and Handle Outliers using Various Methods in Python

In this tutorial, we will learn how to detect and handle outliers using various methods in Python programming. Handling Outliers is a very important step that is responsible for the robustness and reliability of the data. Outliers can end up inclining the results in a particular direction and for the developers to get faulty results.

Step 1 – Data Creation

Before starting to detect and handle outliers, let’s start by creating data on which we will be detecting and handling outliers. For data creation, we will simply be using the Numpy library and create random data points. Have a look at the code snippet below. For data creation and purposely including outliers in our data, we will be following the steps below:

  1. Creating a Random Normal Distribution Data Points with a mean of 0 and a standard deviation of 1.
  2. Generate n number of random index values where we will add the outlier values.
  3. Replace the values with random values that are outside the range of the original data (by points of a normal distribution with a mean of 10 and a standard deviation of 5).
import numpy as np

original_DATA = np.random.normal(loc=0, scale=1, size=1000)
outliers__INDEXING = np.random.choice(1000, size=50, replace=False)
original_DATA[outliers__INDEXING] = np.random.normal(loc=10, scale=5, size=50)

After creation, let’s also have a look at visualizing the data using the Matplotlib library and visualize a simple scatter plot to visualize the outliers present in the data we just created. Have a look at the code snippet below.

import matplotlib.pyplot as plt

plt.scatter(range(len(original_DATA)), original_DATA)
plt.title('Visualization of Data Points with Outliers')
plt.xlabel('Indexing of Data Points')
plt.ylabel('Value of Data Points')
plt.show()

The resulting scatter plot we get is shown below:

Scatter Plot - Outliers

You can see that many points don’t belong to the same cluster of data known as outliers. We aim to detect and remove them in the upcoming sections.

Step 2 – Detect Outliers in the Dataset

There are multiple ways of detecting outliers in the dataset, some of them are as follows:

  1. Visual Inspection (Scatter Plot/Box Plot)
  2. Z-Score Method
  3. Interquartile Range (IQR) Method

Visual Inspection (Scatter Plot/Box Plot)

Earlier, we had a look at the scatter plot which also helped us visualize the outlier points in the dataset. There is yet another plot we can use i.e. Box Plot. If you are unaware of how Box plots work, have a look at the tutorial below.

Also Read: Understanding Python pandas.DataFrame.boxplot

Have a look at the code snippet below, we will visualize the box plot for our data how it looks, and what information it gives in regards to Outliers detection.

plt.boxplot(original_DATA)
plt.title('Box Plot of original_DATA')
plt.ylabel('Values')
plt.show()

The resulting box plot we get is shown below. The data points that fall beyond the whiskers of the box plots are the outliers for this data.

Detect and Handle Outliers using Various Methods in Python

Z-Score Method

For this method, the developers decide on a threshold value (in our case let’s set it as 3, you can choose any other value). So, the points for which the z-score goes beyond the threshold value, those points are termed outliers. Have a look at the implementation of this method below. To compute z-scores, we will make use of the zscore function under the spicy.stats library as displayed in the code below.

from scipy.stats import zscore

ZScores = zscore(original_DATA)
outliersZScores = original_DATA[np.abs(ZScores) > 3]
print("Outliers identified using Z-score method:", outliersZScores)

The result of the code is:

Outliers identified using Z-score method: [ 9.86350139 15.35404599  8.29283482 19.09536569  8.25737655  8.29948724
  9.57236708  9.30812776 15.40890745 14.17014422 12.64205099 10.67552906
 10.36795475 15.37176292  9.87939626  9.73487269 13.67495177 13.45637576
  9.1231785  14.96648791 14.17524845 13.58461933 10.70699131 11.31068463
 16.51886462 14.35204512  9.59282808 12.76576669 16.24021761 14.36302842
 14.82071389  8.84944193 12.65669227]

Interquartile Range (IQR) Method

This is more of a statistical approach, if you are unaware of this approach, you can have a look at this tutorial mentioned below. For now, I have directly implemented this method using the code snippet below.

Also Read: Outlier Detection from Inter-Quartile Range in Machine Learning | Python

q1 = np.percentile(original_DATA, 25)
q3 = np.percentile(original_DATA, 75)
IQR = q3 - q1
lower_bound = q1 - 1.5 * IQR
upper_bound = q3 + 1.5 * IQR
outliersIQR = original_DATA[(original_DATA < lower_bound) | (original_DATA > upper_bound)]
print("Outliers identified using IQR method:", outliersIQR)

The result of the code is:

Outliers identified using IQR method: [ 4.91446421  9.86350139 15.35404599  4.56582595  8.29283482 19.09536569
  8.25737655  8.29948724  5.18968851  7.57650957  9.57236708  9.30812776
 15.40890745 14.17014422  7.27494202 12.64205099 10.67552906 10.36795475
  4.71994926 15.37176292  9.87939626  9.73487269 13.67495177  3.25676401
 13.45637576  9.1231785  14.96648791 14.17524845  3.38901353  6.30842913
 13.58461933  7.21873599 10.70699131 11.31068463  2.91150479 16.51886462
 14.35204512  8.03027446  9.59282808 12.76576669 16.24021761  4.28321232
 14.36302842  2.99135621 14.82071389  8.84944193  3.32167741 12.65669227]

Step 3 – Handle Outliers in the Dataset

There are multiple ways of handling/removing outliers in the dataset, some of them are as follows:

  1. Trimming
  2. Winsorization
  3. Z-score Method
  4. IQR (Interquartile Range) Method

Method 1: Trimming Method

In this method, we aim to remove a certain percentage of data, like trimming the data from both extreme ends (front and back). This method keeps only a certain percentage of data which doesn’t have any outliers. For example, if you have a list of test scores, and some students scored unusually high or low, you remove those extreme scores.

Have a look at the code snippet below.

def trimmingMethod(data, percentage):

    lower_percentile = np.percentile(data, percentage)
    upper_percentile = np.percentile(data, 100 - percentage)

    trimmed_data = data[(data >= lower_percentile) & (data <= upper_percentile)]

    return trimmed_data

trimmed_data = trimmingMethod(original_DATA, 5)

plt.scatter(range(len(trimmed_data)), trimmed_data)
plt.title('Visualization of Data Points after Trimming')
plt.xlabel('Indexing of Data Points')
plt.ylabel('Value of Data Points')
plt.show()

In the function trimmingMethod, we will be trimming the original data by removing the lower and upper percentile of the data. We will also plot the new trimmed data using the same code we used above. The resulting scatter plot that is displayed on the screen is shown below.

Detect and Handle Outliers using Various Methods in Python

Method 2: Winsorization Method

In this method, instead of removing the extreme values on both ends, we will be replacing the values with less extreme values. This tends to make outliers come closer to the original set of Data, hence, removing/treating outliers. For instance, if you have a dataset of house prices and some houses are extremely expensive or cheap, you tend to decrease/increase their prices to make the dataset more balanced.

You can see there is one difference between the implementation of Trimming and Winsorization, the difference being that after computing lower and upper bound values, instead of removing the values completely, we will be setting the old values with new and less extreme values.

def winsorizationMethod(data, percentage):
    lower_bound = np.percentile(data, percentage)
    upper_bound = np.percentile(data, 100 - percentage)
    data[data < lower_bound] = lower_bound
    data[data > upper_bound] = upper_bound
    return data

winsorized_data = trimmingMethod(original_DATA, 5)

plt.scatter(range(len(winsorized_data)), winsorized_data)
plt.title('Visualization of Data Points after Winsorization')
plt.xlabel('Indexing of Data Points')
plt.ylabel('Value of Data Points')
plt.show()

The resulting scatter plot that is displayed on the screen is shown below.

Winsorization Method

Method 3: Z-Score Method

In this method, as we mentioned earlier in the detection of outliers. This method will first detect and then handle the outliers. For instance, if a data point is too far away (like a person who is unusually tall or short), it’s considered an outlier and removed from the dataset. Have a look at the code below.

def zScoreMethod(data, threshold=2):
    z_scores = np.abs((data - np.mean(data)) / np.std(data))
    outlier_indices = np.where(z_scores > threshold)
    data_without_outliers = data[z_scores <= threshold]
    return data_without_outliers

zScore_data = zScoreMethod(original_DATA)

plt.scatter(range(len(zScore_data)), zScore_data)
plt.title('Visualization of Data Points after Z Score Method')
plt.xlabel('Indexing of Data Points')
plt.ylabel('Value of Data Points')
plt.show()

The resulting scatter plot that is displayed on the screen is shown below.

Z-Score

This method was not as efficient as the ones mentioned before, as it is very dependent on the threshold value, for a threshold value of 2, there are still outliers present in the dataset. Hence, it’s not that efficient to work with.

Method 4: IQR (Interquartile Range) Method

IQR method is like finding the range where most of the data lies. In data, it calculates the range where the middle 50% of the data points lie. If a data point falls far outside this range, it’s considered an outlier and removed.

Have a look at the code below.

def IQRMethod(data):
    q1 = np.percentile(data, 25)
    q3 = np.percentile(data, 75)
    iqr = q3 - q1
    lower_bound = q1 - 1.5 * iqr
    upper_bound = q3 + 1.5 * iqr
    filtered_data = data[(data >= lower_bound) & (data <= upper_bound)]
    return filtered_data

IQR_data = IQRMethod(original_DATA)

plt.scatter(range(len(IQR_data)), IQR_data)
plt.title('Visualization of Data Points after IQR Method')
plt.xlabel('Indexing of Data Points')
plt.ylabel('Value of Data Points')
plt.show()

The resulting scatter plot that is displayed on the screen is shown below.

IQR - Scatter Plot

I hope you liked this tutorial and now if someone gives you a task to detect and handle outliers for any dataset, you will fly in that task with flying colors. Thank you for reading the tutorial!

Happy Learning!

Leave a Reply

Your email address will not be published. Required fields are marked *