Correlation calculation between variables in Python
Hi guys, In this article, we will be looking at the steps to calculate the correlation between variables in Python. In simple language, a correlation is a relationship between two random variables basically with respect to statistics.
Refer to the following article for more details on correlation: Correlation in Python
Below are some common correlation defined in statistics.
- Pearson’s correlation
- Spearman’s correlation
- Kendall’s correlation
Calculating Correlation in Python
We can measure the correlation between two or more variables using the Pingouin module. The very first step is to install the package by using the basic command
pip install --upgrade pingouin
Once you have installed the package import it in the program
import pingouin as pi
Now let’s take a random data set that contains the outcome of personality tests of 200 individuals also including their age, height, weight and IQ. (If you want I can give you the code to generate the random dataset)
We have calculated the correlation between the height and weight of the individuals using the pingouin.corr function.
pi.corr(x=df['Height'], y=df['Weight'])
Full code
import pingouin as pi import pandas df = pandas.read_csv('myDataset.csv') print('%i people and %x columns' % df.shape) df.head() pi.corr(x=df['Height'], y=df['Weight'])
The output of the above code will be
Here r is the correlation coefficient.
This method is a little confusing. We have one easy method(The above module is based on this method). In this we simply have to create the dataframe(df) and call df.corr(method=” “) in which the method takes three arguments(‘pearson’ , ‘kendall’ , ‘spearman’). For instance, look below for the implementation.
import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sb df = pandas.read_csv('myDataset.csv') df.head() pearson_correlation = df.corr(method='pearson') print(pearson_correlation) sb.heatmap(pearson_correlation, xticklabels=pearson_correlation.columns, yticklabels=pearson_correlation.columns, cmap="YlGnBu", annot=True, linewidth=0.5) spearman_correlation=df.corr(method='spearman') print(spearman_correlation) kendall_correlation=df.corr(method='kendall') print(kendall_correlation)
Output:
As you can see the diagonal values are 1 which represents a strong positive relationship between the two same variables. To determine the correlation between two different variables just search the corresponding row name to the corresponding column name.
Leave a Reply