Identify Skewness in Box Plots in Python
Hey fellow Python coder! In this tutorial, we will learn about skewness and also learn how to identify the same in Boxplots using Python. Let’s start by understanding what we mean by skewness.
Also Read: How to Box plot visualization with Pandas and Seaborn
In general, Skewness refers to asymmetry/ lack of symmetry in data. To understand the concept in terms of box plots, have a look at the illustration below.
You can see the difference between positive and negative skewness. For Positive Skewness most of the data points are present after the median line and for Negative Skewness, the case is the opposite. You will be able to get a better understanding once we learn about the code implementation.
For the code implementation, to make things interesting let’s consider visualizing the strength/power levels of superheroes in various categories where in each category we will demonstrate a different skewness type. The categories that I will be considering are: ‘Powerhouse’ ( One with a great level of power), ‘Tech Genius’ ( The Tech power source of the team ), and ‘Speedster’ (The one where speeding is key to their power) which will represent no, positive and negative skewness respectively.
Also Read: Understanding Python pandas.DataFrame.boxplot
Code Implementation for Normal Distribution
To represent no skewness, we will take the superheroes belonging to the category ‘Powerhouse’. We will start by creating data and then plotting the boxplot using the matplotlib
library as displayed in the code below. We will also add labels and titles to the plot to make the visualization more appealing.
import numpy as np import matplotlib.pyplot as plt data1 = np.random.normal(50000, 10000, 100) plt.figure(figsize=(8,4)) plt.boxplot([data1], labels=['PowerHouse'], vert=False) plt.title('PowerHouse SuperHeros Strength Levels') plt.xlabel('Superhero Categroy') plt.ylabel('Power Levels') plt.show()
We follow the following steps:
- Import Modules like Numpy and Matplotlib libraries
- Create a random dataset using
random.normal
function - Then we plot the dataset in the box plot with the help ofÂ
boxplot
function and add some extra content to the visualizations using thexlabel
,ylabel
andtitle
function. - Also, the match the illustration, let’s make the visualizations horizontal using
vert
parameter and set it toFalse
.
The output of the code looks like this:
You can see that in this case, the distribution of data is pretty much uniformly distributed.
Python Code Implementation for Positive Skewness
We will be following the same procedure even for this case with the only exception being that the dataset created will have a different set of data points. As we need more of the larger values we will add a set of values with a much larger range as we did below where the other range goes up until 150000
.
Have a look at the code below:
import numpy as np import matplotlib.pyplot as plt data2 = np.concatenate([np.random.normal(50000, 10000, 80), np.random.normal(150000, 50000, 20)]) plt.figure(figsize=(8,4)) plt.boxplot([data2], labels=['Tech Genius'], vert=False) plt.title('PowerHouse SuperHeros Strength Levels') plt.xlabel('Superhero Categroy') plt.ylabel('Power Levels') plt.show()
The output of the code looks like this:
We observe that in this case, the distribution of data is more aligned with the right side which signifies that it’s positively skewed.
Python Code Implementation for Negative Skewness
We will be following the same procedure even for this case with the only exception again being the dataset that is created. For this case, we would need more of the smaller values in our data compared to the larger values as shown in the code below:
import numpy as np import matplotlib.pyplot as plt data3 = np.concatenate([np.random.normal(50000, 10000, 80), np.random.normal(20000, 5000, 20)]) plt.figure(figsize=(8,4)) plt.boxplot([data3], labels=['Speedster'], vert=False) plt.title('PowerHouse SuperHeros Strength Levels') plt.xlabel('Superhero Categroy') plt.ylabel('Power Levels') plt.show()
The output of the code looks like this:
We observe that in this case, the distribution of data is more aligned with the left side which signifies that it’s negatively skewed.
Visualize all Skewness Type in One Plot
To visualize the plots next to each other we will make use of Side-by-Side Boxplots using the code snippet below. If you are unaware of what Side by Side boxplots are have a look at the tutorial below.
Also Read: Side by side Boxplots in Python
import numpy as np import matplotlib.pyplot as plt data1 = np.random.normal(50000, 10000, 100) data2 = np.concatenate([np.random.normal(50000, 10000, 80), np.random.normal(150000, 50000, 20)]) data3 = np.concatenate([np.random.normal(50000, 10000, 80), np.random.normal(20000, 5000, 20)]) combined_data = [data1, data2, data3] plt.figure(figsize=(6, 4)) plt.boxplot(combined_data, labels=['PowerHouse(0)', 'Tech Genius(+ve)', 'Speedster(-ve)'], vert=False) plt.title('SuperHeros Strength Levels') plt.xlabel('Power Levels') plt.ylabel('Superhero Category') plt.show()
The final plot looks like this:
Also Read:
Leave a Reply