Hierarchical Clustering Analysis

Post Views: 892

Hey guys, today in another data science post we will talk about hierarchical clustering. Let’s have a simple definition of clustering first. Clustering uses techniques that require certain data points on a scatter plot, for instance, to be classified under one class and give them a class label and instances which are the other way around for classification.

Hierarchical Clustering Analysis (HCA)

Let us assume we have a data-points of animals. Each point might represent a different animal. We start with one data point and look for the closest point to it. For example, dog and wolf come under one cluster, tiger and cat come under another cluster based on the properties mentioned in the dataset. Then it builds a dendrogram a hierarchy of clusters.

HCA is of two types

Agglomerative
Divisive.

The one we talked about here was the agglomerative technique which is a bottom-up approach.

Each data point is its own cluster.
Compute the closest cluster to the current cluster.
Repeat 2 until we have all clusters under a supercluster.

Dendrograms help us map the clusters. A dendrogram is a hierarchy that shows a relationship among the clusters in the data. Each level describes the sub-cluster and the objects that fall into it. As we said we start from the bottom.

Let’s continue our example with animals. Now when we start with dog and wolf it would be level 1 and a small cluster and then similarly tiger and cat would be another dissimilar but same level cluster which also sits at level 1 and then we proceed with the similarities where the cluster would be a carnivores cluster which is a level 2 cluster and takes some points from level 1 clusters. Now, here we see a pattern the closest the clusters are to each other the more we can classify them and the farther ones would be a unique cluster.

Hey guys, today in another data science post we will talk about hierarchical clustering. Let's have a simple definition of clustering first. Clustering uses techniques that require certain data points on a scatter plot, for instance, to be classified under one class and give them a class label and instances which are the other way around for classification. Hierarchical Clustering Analysis(HCA) Let us assume we have a data-points of animals. Each point might represent a different animal. We start with one data point and look for the closest point to it. For example, dog and wolf come under one cluster, tiger and cat come under another cluster based on the properties mentioned in the dataset. Then it builds a dendrogram a hierarchy of clusters. HCA is of two types Agglomerative Divisive. The one we talked about here was the agglomerative technique which is a bottom-up approach. Each data point is its own cluster. Compute the closest cluster to the current cluster. Repeat 2 until we have all clusters under a supercluster. Dendrograms help us map the clusters. A dendrogram is a hierarchy that shows a relationship among the clusters in the data. Each level describes the sub-cluster and the objects that fall into it. As we said we start from the bottom. Let's continue our example with animals. Now when we start with dog and wolf it would be level 1 and a small cluster and then similarly tiger and cat would be another dissimilar but same level cluster which also sits at level 1 and then we proceed with the similarities where the cluster would be a carnivores cluster which is a level 2 cluster and takes some points from level 1 clusters. Now, here we see a pattern the closest the clusters are to each other the more we can classify them and the farther ones would be a unique cluster. https://drive.google.com/file/d/13Kn2Kmjm00qxaIByvbiAfV59MINy5yCW/view?usp=sharing https://drive.google.com/file/d/1bZU4fYDbIatImTl8jQpNkFcFs3ShRcJ0/view?usp=sharing Calculating distance We can calculate the cluster distance in four ways Minimum - the closest distance between two points Maximum - farthest distance between two points Average - an average of the distances between two clusters Euclidean distance Space and Time Complexity Space Complexity: It requires to store the similarity matrix in the RAM it's complexity is O(n2). Time Complexity: Each iteration and updating of the similarity matrix requires O(n3). Disadvantage This algorithm doesn't work with huge data since it has high space and time complexity. Also, read- Image Classification in python Thank you

Data points

Dendrogram

Calculating distance

We can calculate the cluster distance in four ways

Minimum – the closest distance between two points
Maximum – farthest distance between two points
Average – an average of the distances between two clusters
Euclidean distance

Space and Time Complexity – HCA

Space Complexity: It requires to store the similarity matrix in the RAM it’s complexity is O(n²).

Time Complexity: Each iteration and updating of the similarity matrix requires O(n³).

Disadvantage of HCA

This algorithm doesn’t work with huge data since it has high space and time complexity.

Also, read- Image Classification in python

Thank you

Hierarchical Clustering Analysis

Hierarchical Clustering Analysis (HCA)

Calculating distance

Space and Time Complexity – HCA

Disadvantage of HCA

Leave a Reply Cancel reply

Related Posts