Bivariate Analysis in Python

In the data science field, when you get the data, the first step that is performed is exploratory data analysis. So, in this tutorial, we will explore the concept of bivariate analysis.

Bivariate Analysis

As the name suggests, bivariate means two variables. Therefore, we can say that the analysis is performed on two variables. Now, the question is, what do we aim for from this analysis? The goal is to determine the relation between the two variables. A variable is of two types: Continuous and Categorical.
A continuous variable is quantitative, whereas a categorical variable is qualitative.

Continuous and Continuous Variables

For this analysis, we use a scatter plot for visualization and calculate the correlation coefficient to understand the relationship between the two variables. I am using the iris dataset for bivariate analysis. As we must select quantitative values, I have selected the Sepal Length and Sepal Width columns.

import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.datasets import load_iris

# Load Iris dataset
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['species'] = pd.Categorical.from_codes(iris.target, iris.target_names)


# Let's compare sepal length and sepal width for continuous continuous variable type
plt.figure(figsize=(10, 6))
sns.scatterplot(x='sepal length (cm)', y='sepal width (cm)', hue='species', style='species', data=df)
plt.title('Scatter Plot of Sepal Length vs. Sepal Width by Species')
plt.xlabel('Sepal Length (cm)')
plt.ylabel('Sepal Width (cm)')
plt.show()


# Calculate and display the correlation coefficient between sepal length and sepal width
correlation_matrix = df[['sepal length (cm)', 'sepal width (cm)']].corr()
print("Correlation coefficient between Sepal Length and Sepal Width:\n", correlation_matrix)

Output

Bivariate 1

Correlation coefficient between Sepal Length and Sepal Width:  
                    sepal length (cm)     sepal width (cm) 
sepal length (cm)   1.00000               -0.11757 
sepal width (cm)    -0.11757              1.00000

Explanation

If you don’t know about the correlation matrix, which contains the correlation coefficients, refer to this article: Correlation matrix.

Continuous and Categorical Variables

For this analysis, we use box plots or violin plots for visualization. If we want to determine the relationship mathematically, then ANOVA (Analysis of Variance) is used, which tells the differences between the mean of continuous variables across categories. I am again using the iris dataset, and for the continuous variable, I have used the Petal Length column, and for the Categorical variable, I have used the Species column.

import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
from scipy import stats
from sklearn.datasets import load_iris

# Load Iris dataset
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['species'] = pd.Categorical.from_codes(iris.target, iris.target_names)

# Box Plot
plt.figure(figsize=(10, 6))
sns.boxplot(x='species', y='petal length (cm)', data=df)
plt.title('Petal Length Distribution by Species')
plt.show()

# ANOVA
f_value, p_value = stats.f_oneway(df[df['species'] == 'setosa']['petal length (cm)'],
                                  df[df['species'] == 'versicolor']['petal length (cm)'],
                                  df[df['species'] == 'virginica']['petal length (cm)'])
print(f"ANOVA test results - F value: {f_value}, P value: {p_value}")

Output

bivariate 2

ANOVA test results - F value: 1180.161182252981, P value: 2.8567766109615584e-91

Explanation

The ANOVA tests the null hypothesis that all groups have the same mean against the alternative hypothesis that at least one group differs.
The F value is the ratio of variance between the groups to variance within the groups. A large F-value may indicate a significant effect of the independent variable on the dependent variable. Therefore, it assesses whether the overall variance in the dependent variables is due to the independent variables.
The P value indicates the probability of the null hypothesis being true. Therefore, it tells whether the F value is significant or not. A smaller p-value (<0.05) rejects the null hypothesis.

Categorical and Categorical variables

For this analysis, we use heat maps for visualization, made on a contingency table. If we want to determine the relationship mathematically, then Chi-squared test is used. A contingency table contains the frequency distribution of variables across cases.

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import chi2_contingency


data = {'Gender': ['Male', 'Female', 'Male', 'Female', 'Male', 'Female'],
        'Preference': ['Tea', 'Coffee', 'Tea', 'Tea', 'Coffee', 'Coffee']}
df = pd.DataFrame(data)


contingency_table = pd.crosstab(df['Gender'], df['Preference'])

# Chi-squared test
chi2, p, dof, expected = chi2_contingency(contingency_table)
print(f"Chi-squared test results - Chi2 value: {chi2}, P value: {p}")

# Heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(contingency_table, annot=True, cmap="YlGnBu", fmt="d")
plt.title('Heatmap of Gender vs. Preference')
plt.ylabel('Gender')
plt.xlabel('Preference')
plt.show()

Output

bivariate

Chi-squared test results - Chi2 value: 0.0, P value: 1.0

Explanation

Chi-squared test is used to determine the relationship between the two categorical values. It examines whether the observed distribution of cases across categorical variables deviates from expectation or not. If it deviates, then how much deviate and whether this deviation is likely to occur due to by-chance or not.
The Χ2 value measures the difference between the observed frequencies from the contingency table to expected frequencies ( frequency if there were no relation between two categorical variables). A higher value indicates a stronger relation between the two variables.
The P value indicates the probability of the null hypothesis ( no relation between the variables) being true. Therefore, it tells whether the Χ2 is significant or not. A smaller p-value (<0.05) rejects the null hypothesis, which indicates that the relation between the variables exists and the difference does not occur by chance.

Leave a Reply

Your email address will not be published. Required fields are marked *