Correlation calculation between variables in Python

Post Views: 913

Hi guys, In this article, we will be looking at the steps to calculate the correlation between variables in Python. In simple language, a correlation is a relationship between two random variables basically with respect to statistics.

Refer to the following article for more details on correlation: Correlation in Python

Below are some common correlation defined in statistics.

Pearson’s correlation
Spearman’s correlation
Kendall’s correlation

Calculating Correlation in Python

We can measure the correlation between two or more variables using the Pingouin module. The very first step is to install the package by using the basic command

pip install --upgrade pingouin

Once you have installed the package import it in the program

import pingouin as pi

Now let’s take a random data set that contains the outcome of personality tests of 200 individuals also including their age, height, weight and IQ. (If you want I can give you the code to generate the random dataset)
We have calculated the correlation between the height and weight of the individuals using the pingouin.corr function.

pi.corr(x=df['Height'], y=df['Weight'])

Full code

import pingouin as pi 
import pandas
 
df = pandas.read_csv('myDataset.csv') 
print('%i people and %x columns' % df.shape) 
df.head()

pi.corr(x=df['Height'], y=df['Weight'])

The output of the above code will be

200 subjects and 4 columns

	n	r	CI95%	r2	adj_r2	p-val	BF10	power
pearson	200	0.485	[0.37, 0.58]	0.235	0.227	3.595866e-13	2.179e+10	1.0

Here r is the correlation coefficient.
This method is a little confusing. We have one easy method(The above module is based on this method). In this we simply have to create the dataframe(df) and call df.corr(method=” “) in which the method takes three arguments(‘pearson’ , ‘kendall’ , ‘spearman’). For instance, look below for the implementation.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb

df = pandas.read_csv('myDataset.csv')
df.head()
pearson_correlation = df.corr(method='pearson')
print(pearson_correlation)
sb.heatmap(pearson_correlation, 
            xticklabels=pearson_correlation.columns,
            yticklabels=pearson_correlation.columns,
            cmap="YlGnBu",
            annot=True,
            linewidth=0.5)
spearman_correlation=df.corr(method='spearman')
print(spearman_correlation)
kendall_correlation=df.corr(method='kendall')
print(kendall_correlation)

Output:

    Age        IQ    Height    Weight
Age     1.000000 -0.091642 -0.037185  0.062123
IQ     -0.091642  1.000000 -0.027006 -0.008442
Height -0.037185 -0.027006  1.000000  0.484540
Weight  0.062123 -0.008442  0.484540  1.000000
             Age        IQ    Height    Weight
Age     1.000000 -0.061948 -0.018034  0.038593
IQ     -0.061948  1.000000 -0.029939  0.015395
Height -0.018034 -0.029939  1.000000  0.457071
Weight  0.038593  0.015395  0.457071  1.000000
             Age        IQ    Height    Weight
Age     1.000000 -0.041663 -0.009941  0.029109
IQ     -0.041663  1.000000 -0.017685  0.011402
Height -0.009941 -0.017685  1.000000  0.315211
Weight  0.029109  0.011402  0.315211  1.000000

Correlation calculation between variables in Python

Here I have used the seaborn and matplotlib module to show the above picture as the output gets little messy to study directly. Here I have drawn the heatmap only for the Pearson correlation.

As you can see the diagonal values are 1 which represents a strong positive relationship between the two same variables. To determine the correlation between two different variables just search the corresponding row name to the corresponding column name.

Correlation calculation between variables in Python

Calculating Correlation in Python

Leave a Reply Cancel reply

Related Posts