Detect and Handle Outliers using Various Methods in Python
In this tutorial, we will learn how to detect and handle outliers using various methods in Python programming. Handling Outliers is a very important step that is responsible for the robustness and reliability of the data. Outliers can end up inclining the results in a particular direction and for the developers to get faulty results.
Step 1 – Data Creation
Before starting to detect and handle outliers, let’s start by creating data on which we will be detecting and handling outliers. For data creation, we will simply be using the Numpy
library and create random data points. Have a look at the code snippet below. For data creation and purposely including outliers in our data, we will be following the steps below:
- Creating a Random Normal Distribution Data Points with a mean of 0 and a standard deviation of 1.
- Generate n number of random index values where we will add the outlier values.
- Replace the values with random values that are outside the range of the original data (by points of a normal distribution with a mean of 10 and a standard deviation of 5).
import numpy as np original_DATA = np.random.normal(loc=0, scale=1, size=1000) outliers__INDEXING = np.random.choice(1000, size=50, replace=False) original_DATA[outliers__INDEXING] = np.random.normal(loc=10, scale=5, size=50)
After creation, let’s also have a look at visualizing the data using the Matplotlib
library and visualize a simple scatter plot to visualize the outliers present in the data we just created. Have a look at the code snippet below.
import matplotlib.pyplot as plt plt.scatter(range(len(original_DATA)), original_DATA) plt.title('Visualization of Data Points with Outliers') plt.xlabel('Indexing of Data Points') plt.ylabel('Value of Data Points') plt.show()
The resulting scatter plot we get is shown below:
You can see that many points don’t belong to the same cluster of data known as outliers. We aim to detect and remove them in the upcoming sections.
Step 2 – Detect Outliers in the Dataset
There are multiple ways of detecting outliers in the dataset, some of them are as follows:
- Visual Inspection (Scatter Plot/Box Plot)
- Z-Score Method
- Interquartile Range (IQR) Method
Visual Inspection (Scatter Plot/Box Plot)
Earlier, we had a look at the scatter plot which also helped us visualize the outlier points in the dataset. There is yet another plot we can use i.e. Box Plot. If you are unaware of how Box plots work, have a look at the tutorial below.
Also Read: Understanding Python pandas.DataFrame.boxplot
Have a look at the code snippet below, we will visualize the box plot for our data how it looks, and what information it gives in regards to Outliers detection.
plt.boxplot(original_DATA) plt.title('Box Plot of original_DATA') plt.ylabel('Values') plt.show()
The resulting box plot we get is shown below. The data points that fall beyond the whiskers of the box plots are the outliers for this data.
Z-Score Method
For this method, the developers decide on a threshold value (in our case let’s set it as 3, you can choose any other value). So, the points for which the z-score goes beyond the threshold value, those points are termed outliers. Have a look at the implementation of this method below. To compute z-scores, we will make use of the zscore
function under the spicy.stats
library as displayed in the code below.
from scipy.stats import zscore ZScores = zscore(original_DATA) outliersZScores = original_DATA[np.abs(ZScores) > 3] print("Outliers identified using Z-score method:", outliersZScores)
The result of the code is:
Outliers identified using Z-score method: [ 9.86350139 15.35404599 8.29283482 19.09536569 8.25737655 8.29948724 9.57236708 9.30812776 15.40890745 14.17014422 12.64205099 10.67552906 10.36795475 15.37176292 9.87939626 9.73487269 13.67495177 13.45637576 9.1231785 14.96648791 14.17524845 13.58461933 10.70699131 11.31068463 16.51886462 14.35204512 9.59282808 12.76576669 16.24021761 14.36302842 14.82071389 8.84944193 12.65669227]
Interquartile Range (IQR) Method
This is more of a statistical approach, if you are unaware of this approach, you can have a look at this tutorial mentioned below. For now, I have directly implemented this method using the code snippet below.
Also Read: Outlier Detection from Inter-Quartile Range in Machine Learning | Python
q1 = np.percentile(original_DATA, 25) q3 = np.percentile(original_DATA, 75) IQR = q3 - q1 lower_bound = q1 - 1.5 * IQR upper_bound = q3 + 1.5 * IQR outliersIQR = original_DATA[(original_DATA < lower_bound) | (original_DATA > upper_bound)] print("Outliers identified using IQR method:", outliersIQR)
The result of the code is:
Outliers identified using IQR method: [ 4.91446421 9.86350139 15.35404599 4.56582595 8.29283482 19.09536569 8.25737655 8.29948724 5.18968851 7.57650957 9.57236708 9.30812776 15.40890745 14.17014422 7.27494202 12.64205099 10.67552906 10.36795475 4.71994926 15.37176292 9.87939626 9.73487269 13.67495177 3.25676401 13.45637576 9.1231785 14.96648791 14.17524845 3.38901353 6.30842913 13.58461933 7.21873599 10.70699131 11.31068463 2.91150479 16.51886462 14.35204512 8.03027446 9.59282808 12.76576669 16.24021761 4.28321232 14.36302842 2.99135621 14.82071389 8.84944193 3.32167741 12.65669227]
Step 3 – Handle Outliers in the Dataset
There are multiple ways of handling/removing outliers in the dataset, some of them are as follows:
- Trimming
- Winsorization
- Z-score Method
- IQR (Interquartile Range) Method
Method 1: Trimming Method
In this method, we aim to remove a certain percentage of data, like trimming the data from both extreme ends (front and back). This method keeps only a certain percentage of data which doesn’t have any outliers. For example, if you have a list of test scores, and some students scored unusually high or low, you remove those extreme scores.
Have a look at the code snippet below.
def trimmingMethod(data, percentage): lower_percentile = np.percentile(data, percentage) upper_percentile = np.percentile(data, 100 - percentage) trimmed_data = data[(data >= lower_percentile) & (data <= upper_percentile)] return trimmed_data trimmed_data = trimmingMethod(original_DATA, 5) plt.scatter(range(len(trimmed_data)), trimmed_data) plt.title('Visualization of Data Points after Trimming') plt.xlabel('Indexing of Data Points') plt.ylabel('Value of Data Points') plt.show()
In the function trimmingMethod
, we will be trimming the original data by removing the lower and upper percentile of the data. We will also plot the new trimmed data using the same code we used above. The resulting scatter plot that is displayed on the screen is shown below.
Method 2: Winsorization Method
In this method, instead of removing the extreme values on both ends, we will be replacing the values with less extreme values. This tends to make outliers come closer to the original set of Data, hence, removing/treating outliers. For instance, if you have a dataset of house prices and some houses are extremely expensive or cheap, you tend to decrease/increase their prices to make the dataset more balanced.
You can see there is one difference between the implementation of Trimming and Winsorization, the difference being that after computing lower and upper bound values, instead of removing the values completely, we will be setting the old values with new and less extreme values.
def winsorizationMethod(data, percentage): lower_bound = np.percentile(data, percentage) upper_bound = np.percentile(data, 100 - percentage) data[data < lower_bound] = lower_bound data[data > upper_bound] = upper_bound return data winsorized_data = trimmingMethod(original_DATA, 5) plt.scatter(range(len(winsorized_data)), winsorized_data) plt.title('Visualization of Data Points after Winsorization') plt.xlabel('Indexing of Data Points') plt.ylabel('Value of Data Points') plt.show()
The resulting scatter plot that is displayed on the screen is shown below.
Method 3: Z-Score Method
In this method, as we mentioned earlier in the detection of outliers. This method will first detect and then handle the outliers. For instance, if a data point is too far away (like a person who is unusually tall or short), it’s considered an outlier and removed from the dataset. Have a look at the code below.
def zScoreMethod(data, threshold=2): z_scores = np.abs((data - np.mean(data)) / np.std(data)) outlier_indices = np.where(z_scores > threshold) data_without_outliers = data[z_scores <= threshold] return data_without_outliers zScore_data = zScoreMethod(original_DATA) plt.scatter(range(len(zScore_data)), zScore_data) plt.title('Visualization of Data Points after Z Score Method') plt.xlabel('Indexing of Data Points') plt.ylabel('Value of Data Points') plt.show()
The resulting scatter plot that is displayed on the screen is shown below.
This method was not as efficient as the ones mentioned before, as it is very dependent on the threshold value, for a threshold value of 2, there are still outliers present in the dataset. Hence, it’s not that efficient to work with.
Method 4: IQR (Interquartile Range) Method
IQR method is like finding the range where most of the data lies. In data, it calculates the range where the middle 50% of the data points lie. If a data point falls far outside this range, it’s considered an outlier and removed.
Have a look at the code below.
def IQRMethod(data): q1 = np.percentile(data, 25) q3 = np.percentile(data, 75) iqr = q3 - q1 lower_bound = q1 - 1.5 * iqr upper_bound = q3 + 1.5 * iqr filtered_data = data[(data >= lower_bound) & (data <= upper_bound)] return filtered_data IQR_data = IQRMethod(original_DATA) plt.scatter(range(len(IQR_data)), IQR_data) plt.title('Visualization of Data Points after IQR Method') plt.xlabel('Indexing of Data Points') plt.ylabel('Value of Data Points') plt.show()
The resulting scatter plot that is displayed on the screen is shown below.
I hope you liked this tutorial and now if someone gives you a task to detect and handle outliers for any dataset, you will fly in that task with flying colors. Thank you for reading the tutorial!
Happy Learning!
Leave a Reply