Implementation of Agglomerative Clustering with Scikit-Learn
Unsupervised algorithms for machine learning search for patterns in unlabelled data. Agglomerative clustering is a technique in which we cluster the data into classes in a hierarchical manner. You can start using a top-down approach or a bottom-up approach. In the bottom-up approach, all data points are treated as unique clusters at the start. Then, in each iteration, the algorithm merges the two closest clusters into a single cluster. This process continues until you reach the required number of clusters. For a better theoretical understanding of how agglomerative clustering works, you can refer here.
In this article, we see the implementation of hierarchical clustering analysis using Python and the scikit-learn library.
Agglomerative clustering with Sklearn
You will require Sklearn, python’s library for machine learning. We will be using a readily available dataset present in Scikit-Learn, the iris dataset. This is a common dataset for beginners to use while experimenting with machine learning techniques.
To start off, the necessary libraries are imported.
from sklearn import datasets from sklearn.decomposition import PCA import numpy as np import pandas as pd
We follow this by loading the dataset and split into input and output.
iris = datasets.load_iris() X=iris['data'] Y=iris.target print(X.shape) >>> (150, 4)
The data consists of four attributes and 150 records. The target variable is Y and has three categories. Since this is unsupervised learning, we will not be providing Y to the clustering algorithm. However, we will be keeping it to compare with the results of our clustering algorithm. We now perform Principal Component Analysis to reduce the features from four to two, for ease of visualization.
X = PCA(n_components=2).fit_transform(X) plt.scatter(X[:,0],X[:,1])
And we can see:
The features have been reduced to two variables and are now ready for clustering. The Agglomerative clustering module present inbuilt in sklearn is used for this purpose.
from sklearn.cluster import AgglomerativeClustering classifier = AgglomerativeClustering(n_clusters = 3, affinity = 'euclidean', linkage = 'complete') clusters = classifer.fit_predict(X)
The parameters for the clustering classifier have to be set. The number of clusters that we want to group the data into is 3. You can try changing this parameter and see the results vary. The distance metric for determining the closeness of points is Euclidean distance. Finally, we set the type of linkage as complete linkage. This determines how the clusters will be grouped together.
The two most common linkages in agglomeration clustering are single linkage and complete linkage. In Single Linkage, the distance between two clusters is the distance between their two closest points. In contrast, in Complete Linkage we consider the distance between the two farthest points in the two clusters to be their distance. Try experimenting with different types of linkages; consequently, the data will cluster differently each time.
Following this, we visualize the results of the clustering algorithm.
plt.scatter(X[clusters == 0, 0], X[clusters == 0, 1], label = 'Type 1') plt.scatter(X[clusters == 1, 0], X[clusters == 1, 1], label = 'Type 2') plt.scatter(X[clusters == 2, 0], X[clusters == 2, 1], label = 'Type 3') plt.title('Clusters') plt.show()
You can compare it with the actual target values of the data and can hence see that the groups are not very different.
plt.scatter(X[Y == 0, 0], X[Y == 0, 1], label = 'Type 1') plt.scatter(X[Y == 1, 0], X[Y == 1, 1], label = 'Type 2') plt.scatter(X[Y == 2, 0], X[Y == 2, 1], label = 'Type 3')
Thus, we have implemented agglomeration clustering using python and Scikit-Learn.