Pearson Correlation Test between two variables in Python
One always needs to find relationships between variables before going further with Machine Learning algorithms on the dataset.
The correlation or correlation coefficient captures the relationship between two variables, numerically.
This tutorial covers the following:
- What is Correlation?
- Pearson’s Correlation
- Implementation in Python
What is Correlation?
Correlation answers our questions like:
- How much does variable cause or depend on the values of another variable?
- How loosely or tightly one variable is associated with another variable?
- Considering a real-world example, Does the salary of an employee depend on the employee’s work experience?
Correlation refers to the statistical relationship between the two variables.
The value of the correlation coefficient could be positive, negative, and sometimes also be zero.
- Positive correlation: The increase in the value of one variable causes the value of another variable to increase too. (moves in the same direction)
- Negative correlation: The increase in the value of one variable causes the value of another variable to decrease. (moves in the opposite direction)
- Neutral correlation: There is no relationship in the change of the variables.
The performance of some algorithms can drop if the independent variables are strongly related (positive or negative), called multicollinearity. For example, in linear regression, one of the correlated variables need to be discarded in order to improve the performance of the model.
We may be also interested in the relationship between the input variables with the output variable in order to know which variables are relevant as inputs for developing a model.
Pearson correlation coefficient quantifies the linear relationship between two variables. It can be any value that lies between -1 to 1. The positive and negative value indicates the same behavior discussed earlier in this tutorial.
The mathematical formula of Pearson’s correlation:
correlation = covariance(x, y) / (std(x) * std(y))
Covariance summarizes the relationship between two variables. It is the average of the product between the values of each sample. The problem with covariance as a statistical tool is that it is very challenging to interpret its value.
Coming back to Pearson’s correlation, it is given as the covariance between x and y divided by the product of their respective standard deviations.
Implementation in Python
Pearson’s correlation with NumPy.
Here we create two NumPy arrays x and y of 10 integers each. Once we have the two arrays of the same length we can use the np.corrcoef() to get the correlation value.
import numpy as np x = np.arange(25, 35) y = np.array([10, 14, 17, 23, 25, 29, 32, 36, 70, 39]) np.corrcoef(x, y)
array([[1. , 0.83801964], [0.83801964, 1. ]])
The upper left and the lower right values (diagonal values) are 1. The upper left value is the correlation for x and x, while the lower right value is the correlation for y and y which will be always 1.
However, what we need here are the upper right or the lower left values which is the Pearson correlation for x and y.
In this case, it is 0.83 which clearly says x and y both are strongly correlated with each other.
Let’s plot to see the relationship more clearly.
from matplotlib import pyplot pyplot.scatter(x, y) pyplot.show()
We can see, the figure shows a strong positive correlation between x and y.
Other ways of calculating Pearson’s correlation are with the SciPy and Pandas library using the pearsonr() and corr() function respectively.
Let us see the implementation of the same.
Pearson Correlation with SciPy.
import numpy as np from scipy.stats import stats x = np.arange(25, 35) y = np.array([10, 14, 17, 23, 25, 29, 32, 36, 70, 39]) stats.pearsonr(x, y)
In addition to the correlation value, this function also returns the p-value (0.00246).
The p-value is used in statistical methods while testing the hypothesis. However, it is a very important measure and needs deep knowledge of statistics and probability.
Pearson Correlation with Pandas.
import pandas as pd x = pd.Series(range(25, 35)) y = pd.Series([10, 14, 17, 23, 25, 29, 32, 36, 70, 39]) print(x.corr(y), y.corr(x))