Identify Skewness in Box Plots in Python

Hey fellow Python coder! In this tutorial, we will learn about skewness and also learn how to identify the same in Boxplots using Python. Let’s start by understanding what we mean by skewness.

Also Read: How to Box plot visualization with Pandas and Seaborn

In general, Skewness refers to asymmetry/ lack of symmetry in data. To understand the concept in terms of box plots, have a look at the illustration below.

Skewness in Box Plots

You can see the difference between positive and negative skewness. For Positive Skewness most of the data points are present after the median line and for Negative Skewness, the case is the opposite. You will be able to get a better understanding once we learn about the code implementation.

For the code implementation, to make things interesting let’s consider visualizing the strength/power levels of superheroes in various categories where in each category we will demonstrate a different skewness type. The categories that I will be considering are: ‘Powerhouse’ ( One with a great level of power), ‘Tech Genius’ ( The Tech power source of the team ), and ‘Speedster’ (The one where speeding is key to their power) which will represent no, positive and negative skewness respectively.

Also Read: Understanding Python pandas.DataFrame.boxplot

Code Implementation for Normal Distribution

To represent no skewness, we will take the superheroes belonging to the category ‘Powerhouse’. We will start by creating data and then plotting the boxplot using the matplotlib library as displayed in the code below. We will also add labels and titles to the plot to make the visualization more appealing.

import numpy as np
import matplotlib.pyplot as plt

data1 = np.random.normal(50000, 10000, 100)

plt.figure(figsize=(8,4))
plt.boxplot([data1], labels=['PowerHouse'], vert=False)
plt.title('PowerHouse SuperHeros Strength Levels')
plt.xlabel('Superhero Categroy')
plt.ylabel('Power Levels')
plt.show()

We follow the following steps:

  1. Import Modules like Numpy and Matplotlib libraries
  2. Create a random dataset using random.normal function
  3. Then we plot the dataset in the box plot with the help of  boxplot function and add some extra content to the visualizations using the xlabel, ylabel and title function.
  4. Also, the match the illustration, let’s make the visualizations horizontal using vert parameter and set it to False.

The output of the code looks like this:

Normal Distribution - No Skewness - Box Plot Output

You can see that in this case, the distribution of data is pretty much uniformly distributed.

Python Code Implementation for Positive Skewness

We will be following the same procedure even for this case with the only exception being that the dataset created will have a different set of data points. As we need more of the larger values we will add a set of values with a much larger range as we did below where the other range goes up until 150000.

Have a look at the code below:

import numpy as np
import matplotlib.pyplot as plt

data2 = np.concatenate([np.random.normal(50000, 10000, 80),
                        np.random.normal(150000, 50000, 20)])

plt.figure(figsize=(8,4))
plt.boxplot([data2], labels=['Tech Genius'], vert=False)
plt.title('PowerHouse SuperHeros Strength Levels')
plt.xlabel('Superhero Categroy')
plt.ylabel('Power Levels')
plt.show()

The output of the code looks like this:

Positive Skewness

We observe that in this case, the distribution of data is more aligned with the right side which signifies that it’s positively skewed.

Python Code Implementation for Negative Skewness

We will be following the same procedure even for this case with the only exception again being the dataset that is created. For this case, we would need more of the smaller values in our data compared to the larger values as shown in the code below:

import numpy as np
import matplotlib.pyplot as plt

data3 = np.concatenate([np.random.normal(50000, 10000, 80),
                        np.random.normal(20000, 5000, 20)])


plt.figure(figsize=(8,4))
plt.boxplot([data3], labels=['Speedster'], vert=False)
plt.title('PowerHouse SuperHeros Strength Levels')
plt.xlabel('Superhero Categroy')
plt.ylabel('Power Levels')
plt.show()

The output of the code looks like this:

Negative Skewness - Box Plot Output

We observe that in this case, the distribution of data is more aligned with the left side which signifies that it’s negatively skewed.

Visualize all Skewness Type in One Plot

To visualize the plots next to each other we will make use of Side-by-Side Boxplots using the code snippet below. If you are unaware of what Side by Side boxplots are have a look at the tutorial below.

Also Read: Side by side Boxplots in Python

import numpy as np
import matplotlib.pyplot as plt

data1 = np.random.normal(50000, 10000, 100)
data2 = np.concatenate([np.random.normal(50000, 10000, 80),
                        np.random.normal(150000, 50000, 20)])
data3 = np.concatenate([np.random.normal(50000, 10000, 80),
                        np.random.normal(20000, 5000, 20)])
combined_data = [data1, data2, data3]

plt.figure(figsize=(6, 4))
plt.boxplot(combined_data, labels=['PowerHouse(0)', 'Tech Genius(+ve)', 'Speedster(-ve)'], vert=False)
plt.title('SuperHeros Strength Levels')
plt.xlabel('Power Levels')
plt.ylabel('Superhero Category')
plt.show()

The final plot looks like this:

Negative Skewness - Box Plot Output

Also Read:

  1. Plotting Violin Plots in Python using the Seaborn Library
  2. Create a pie chart using Matplotlib in Python

Leave a Reply

Your email address will not be published. Required fields are marked *