Elbow Method in Python to Find Optimal Clusters in Python

Post Views: 798

In this tutorial, we will learn an exciting method that is used to calculate the number of optimal clusters using the clustering method in Python. As usual, we will first go with theory and then with application in Python.

Elbow Method

The elbow method is generally used to find the number of optimal clusters that should be made from the given dataset. In this method, the explained variance is plotted as a function of a number of clusters, and the point at which the variance rate decreases sharply is observed. This point is called the Elbow of the plot, as the graph looks like a human arm and gives the count of optimal clusters.

Explained Variance

You have studied the variance in your statistics. It’s basically how much the values in a dataset vary from the mean of the dataset. Its formula is:

In our clustering method, we don’t take this variance. That’s why there is a difference in terminology. Explained variance is related to the concept of inertia. Inertia in the context of clustering is defined mathematically as the sum of squared distances between each data point and the closest centroid of the cluster. The formula is analogous to the formula of inertia in Physics, and hence, the term is given to it.

This inertia physically defines the clusters. Lower inertia would mean dense and well-separated clusters. Now, let’s redefine the Elbow point in terms of inertia. It is the point where increasing the number of clusters will not decrease the inertia to a greater extent. That means the inertia will decrease but only minimally. This suggests that if we increase the clusters, there is no improvement in the separation of clusters.
Please note that we are not minimizing the inertia. We are finding the optimal point, not the minimum point.

Python Code: Elbow Method

Let’s apply what we have learned. I am using the iris dataset from sklearn. As we know it has 3 flower classes, so the optimal count of clusters will be 3. Let’s verify from the code.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler

# Load the Iris dataset
data = load_iris()
X = data.data

# Standardizing as the data might be in different scales or ranges
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Calculating SSD (Sum of squared distances) for a range of number of clusters
ssd = []
range_n_clusters = list(range(1, 11))
for n_clusters in range_n_clusters:
    kmeans = KMeans(n_clusters=n_clusters, random_state=0).fit(X_scaled)
    ssd.append(kmeans.inertia_)

# Plotting the SSD for each number of clusters
plt.figure(figsize=(10, 6))
plt.plot(range_n_clusters, ssd, marker='o')
plt.title('Elbow Method for Determining Optimal Number of Clusters on Iris Dataset')
plt.xlabel('Number of clusters')
plt.ylabel('Sum of Squared Distances (SSD)')
plt.xticks(range_n_clusters)
plt.grid(True)
plt.show()

Output

Observe the plot and assess whether the elbow point will be at n =3 Or not.
After n = 3, the inertia decreases less so the optimal number of clusters is 3.

Code Explanation

First, we are loading the iris dataset and then standardizing the values as the data might be in different ranges. After that, we are calculating our inertia using the built-in function kmeans.inertia_ and appending this value in list. We have set the range for number of clusters from 1 to 10 to minimize the computations. Finally, we are plotting the number of clusters and the inertia values.