Detect and exclude outliers in a pandas DataFrame in Python

In this tutorial, we will be looking into How to Detect and exclude outliers in a pandas DataFrame in Python.

What are Outliers?

Outliers are the data points in the dataset that are very much different from the other data. These anomalous data points generally fall far outside of the rest of the data pattern. They usually distort the analysis, interpretation and sometimes modeling of the data itself.

Hence, they need to be removed from our dataset.

Let’s first see how to detect and exclude them.

Detecting and Excluding Outliers

This can be done in many ways:

  • Z-Score Method
  • Interquartile Range (IQR) Method
  • Visualization Techniques
  • One-Class SVM (Support Vector Machine)
  • Mahalanobis Distance
  • Local Outlier Factor (LOF)

And many more.

Let’s look into the implementation part.

1. Z-Score Method – Implementation

It is also known as the Standard score method. Z-score takes into consideration the standard deviation of the data points from the mean to find out the outliers. Let’s see the coding part.

For this, we will first need to install important libraries.

!pip install scipy

Now, we will import important libraries.

import pandas as pd
from scipy import stats

Here we are importing pandas so that we can read our dataframe. You can read more about it here.

Before moving forward, let’s build a sample dataframe with outliers.

data = {'Column1': [1, 2, 3, 4, 5, 100],
        'Column2': [10, 20, 30, 40, 50, 200]}
df = pd.DataFrame(data)

Here we can see that in column 1 ‘100’ is the outlier and in column 2 ‘200’ is the outlier.

Let’s write code to exclude this using Z-Score method.

z_scores = stats.zscore(df)
print(z_scores)

Here, we are using stats.zscore() to calculate the Z-Score value for each of the data points in our dataset. We get results as follows:

z-score

Therefore, we can see that the last row’s z-scores are very high. So, now we will classify all the data points as either an outlier or not an outlier. Let’s see the code:

threshold = 2
outliers = (abs(z_scores) > threshold).all(axis=1)
print(outliers)

Here, we are setting the threshold as 2 and comparing it with all other z-score. If the z-score is greater than the threshold value, we get ‘True’ else we will get ‘False’. We get results as follows:

z-score

Therefore, we can see that the last row has true in it. So, now we will exclude these outliers by dropping the row and printing the final dataset free from outliers. Let’s see the code:

df_no_outliers = df[~outliers]
df_no_outliers

Output:

Detect and exclude outliers in a pandas DataFrame in Python

2. Interquartile Range (IQR) Method – Implementation

Outlier detection from Inter-Quartile Range in Machine Learning | Python

Furthermore, you can also read:

Understanding Support vector machine(SVM)

Leave a Reply

Your email address will not be published. Required fields are marked *