Correlation calculation between variables in Python
Hi guys, In this article, we will be looking at the steps to calculate the correlation between variables in Python. In simple language, a correlation is a relationship between two random variables basically with respect to statistics.
Refer to the following article for more details on correlation: Correlation in Python
Below are some common correlation defined in statistics.
- Pearson’s correlation
- Spearman’s correlation
- Kendall’s correlation
Calculating Correlation in Python
We can measure the correlation between two or more variables using the Pingouin module. The very first step is to install the package by using the basic command
pip install --upgrade pingouin
Once you have installed the package import it in the program
import pingouin as pi
Now let’s take a random data set that contains the outcome of personality tests of 200 individuals also including their age, height, weight and IQ. (If you want I can give you the code to generate the random dataset)
We have calculated the correlation between the height and weight of the individuals using the pingouin.corr function.
import pingouin as pi import pandas df = pandas.read_csv('myDataset.csv') print('%i people and %x columns' % df.shape) df.head() pi.corr(x=df['Height'], y=df['Weight'])
The output of the above code will be
200 subjects and 4 columns
Here r is the correlation coefficient.
This method is a little confusing. We have one easy method(The above module is based on this method). In this we simply have to create the dataframe(df) and call df.corr(method=” “) in which the method takes three arguments(‘pearson’ , ‘kendall’ , ‘spearman’). For instance, look below for the implementation.
import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sb df = pandas.read_csv('myDataset.csv') df.head() pearson_correlation = df.corr(method='pearson') print(pearson_correlation) sb.heatmap(pearson_correlation, xticklabels=pearson_correlation.columns, yticklabels=pearson_correlation.columns, cmap="YlGnBu", annot=True, linewidth=0.5) spearman_correlation=df.corr(method='spearman') print(spearman_correlation) kendall_correlation=df.corr(method='kendall') print(kendall_correlation)
Age IQ Height Weight Age 1.000000 -0.091642 -0.037185 0.062123 IQ -0.091642 1.000000 -0.027006 -0.008442 Height -0.037185 -0.027006 1.000000 0.484540 Weight 0.062123 -0.008442 0.484540 1.000000 Age IQ Height Weight Age 1.000000 -0.061948 -0.018034 0.038593 IQ -0.061948 1.000000 -0.029939 0.015395 Height -0.018034 -0.029939 1.000000 0.457071 Weight 0.038593 0.015395 0.457071 1.000000 Age IQ Height Weight Age 1.000000 -0.041663 -0.009941 0.029109 IQ -0.041663 1.000000 -0.017685 0.011402 Height -0.009941 -0.017685 1.000000 0.315211 Weight 0.029109 0.011402 0.315211 1.000000
Here I have used the seaborn and matplotlib module to show the above picture as the output gets little messy to study directly. Here I have drawn the heatmap only for the Pearson correlation.
As you can see the diagonal values are 1 which represents a strong positive relationship between the two same variables. To determine the correlation between two different variables just search the corresponding row name to the corresponding column name.