Elbow Method in Python to Find Optimal Clusters in Python
In this tutorial, we will learn an exciting method that is used to calculate the number of optimal clusters using the clustering method in Python. As usual, we will first go with theory and then with application in Python.
Elbow Method
The elbow method is generally used to find the number of optimal clusters that should be made from the given dataset. In this method, the explained variance is plotted as a function of a number of clusters, and the point at which the variance rate decreases sharply is observed. This point is called the Elbow of the plot, as the graph looks like a human arm and gives the count of optimal clusters.
Explained Variance
You have studied the variance in your statistics. It’s basically how much the values in a dataset vary from the mean of the dataset. Its formula is:
In our clustering method, we don’t take this variance. That’s why there is a difference in terminology. Explained variance is related to the concept of inertia. Inertia in the context of clustering is defined mathematically as the sum of squared distances between each data point and the closest centroid of the cluster. The formula is analogous to the formula of inertia in Physics, and hence, the term is given to it.
This inertia physically defines the clusters. Lower inertia would mean dense and well-separated clusters. Now, let’s redefine the Elbow point in terms of inertia. It is the point where increasing the number of clusters will not decrease the inertia to a greater extent. That means the inertia will decrease but only minimally. This suggests that if we increase the clusters, there is no improvement in the separation of clusters.
Please note that we are not minimizing the inertia. We are finding the optimal point, not the minimum point.
Python Code: Elbow Method
Let’s apply what we have learned. I am using the iris dataset
from sklearn. As we know it has 3 flower classes, so the optimal count of clusters will be 3. Let’s verify from the code.
import numpy as np import matplotlib.pyplot as plt from sklearn.cluster import KMeans from sklearn.datasets import load_iris from sklearn.preprocessing import StandardScaler # Load the Iris dataset data = load_iris() X = data.data # Standardizing as the data might be in different scales or ranges scaler = StandardScaler() X_scaled = scaler.fit_transform(X) # Calculating SSD (Sum of squared distances) for a range of number of clusters ssd = [] range_n_clusters = list(range(1, 11)) for n_clusters in range_n_clusters: kmeans = KMeans(n_clusters=n_clusters, random_state=0).fit(X_scaled) ssd.append(kmeans.inertia_) # Plotting the SSD for each number of clusters plt.figure(figsize=(10, 6)) plt.plot(range_n_clusters, ssd, marker='o') plt.title('Elbow Method for Determining Optimal Number of Clusters on Iris Dataset') plt.xlabel('Number of clusters') plt.ylabel('Sum of Squared Distances (SSD)') plt.xticks(range_n_clusters) plt.grid(True) plt.show()
Output
Observe the plot and assess whether the elbow point will be at n =3 Or not.
After n = 3, the inertia decreases less so the optimal number of clusters is 3.
Code Explanation
First, we are loading the iris dataset and then standardizing the values as the data might be in different ranges. After that, we are calculating our inertia using the built-in function kmeans.inertia_
and appending this value in list. We have set the range for number of clusters from 1 to 10 to minimize the computations. Finally, we are plotting the number of clusters and the inertia values.
Leave a Reply