Hierarchical Clustering Analysis

Hey guys, today in another data science post we will talk about hierarchical clustering. Let’s have a simple definition of clustering first.  Clustering uses techniques that require certain data points on a scatter plot, for instance, to be classified under one class and give them a class label and instances which are the other way around for classification.

Hierarchical Clustering Analysis (HCA)

Let us assume we have a data-points of animals. Each point might represent a different animal. We start with one data point and look for the closest point to it. For example, dog and wolf come under one cluster, tiger and cat come under another cluster based on the properties mentioned in the dataset. Then it builds a dendrogram a hierarchy of clusters.

HCA is of two types

  1. Agglomerative
  2. Divisive.

The one we talked about here was the agglomerative technique which is a bottom-up approach.

  1.  Each data point is its own cluster.
  2.  Compute the closest cluster to the current cluster.
  3.  Repeat 2 until we have all clusters under a supercluster.

Dendrograms help us map the clusters. A dendrogram is a hierarchy that shows a relationship among the clusters in the data. Each level describes the sub-cluster and the objects that fall into it. As we said we start from the bottom.

Let’s continue our example with animals. Now when we start with dog and wolf it would be level 1 and a small cluster and then similarly tiger and cat would be another dissimilar but same level cluster which also sits at level 1 and then we proceed with the similarities where the cluster would be a carnivores cluster which is a level 2 cluster and takes some points from level 1 clusters. Now, here we see a pattern the closest the clusters are to each other the more we can classify them and the farther ones would be a unique cluster.

Hey guys, today in another data science post we will talk about hierarchical clustering. Let's have a simple definition of clustering first.  Clustering uses techniques that require certain data points on a scatter plot, for instance, to be classified under one class and give them a class label and instances which are the other way around for classification.  Hierarchical Clustering Analysis(HCA)  Let us assume we have a data-points of animals. Each point might represent a different animal. We start with one data point and look for the closest point to it. For example, dog and wolf come under one cluster, tiger and cat come under another cluster based on the properties mentioned in the dataset. Then it builds a dendrogram a hierarchy of clusters.  HCA is of two types  Agglomerative Divisive.  The one we talked about here was the agglomerative technique which is a bottom-up approach.   Each data point is its own cluster.  Compute the closest cluster to the current cluster.  Repeat 2 until we have all clusters under a supercluster.  Dendrograms help us map the clusters. A dendrogram is a hierarchy that shows a relationship among the clusters in the data. Each level describes the sub-cluster and the objects that fall into it. As we said we start from the bottom.  Let's continue our example with animals. Now when we start with dog and wolf it would be level 1 and a small cluster and then similarly tiger and cat would be another dissimilar but same level cluster which also sits at level 1 and then we proceed with the similarities where the cluster would be a carnivores cluster which is a level 2 cluster and takes some points from level 1 clusters. Now, here we see a pattern the closest the clusters are to each other the more we can classify them and the farther ones would be a unique cluster.  https://drive.google.com/file/d/13Kn2Kmjm00qxaIByvbiAfV59MINy5yCW/view?usp=sharing  https://drive.google.com/file/d/1bZU4fYDbIatImTl8jQpNkFcFs3ShRcJ0/view?usp=sharing  Calculating distance  We can calculate the cluster distance in four ways  Minimum - the closest distance between two points Maximum - farthest distance between two points Average - an average of the distances between two clusters Euclidean distance Space and Time Complexity  Space Complexity: It requires to store the similarity matrix in the RAM it's complexity is O(n2).  Time Complexity: Each iteration and updating of the similarity matrix requires O(n3).  Disadvantage  This algorithm doesn't work with huge data since it has high space and time complexity.  Also, read- Image Classification in python  Thank you

Data points

dendrogram

Dendrogram

Calculating distance

We can calculate the cluster distance in four ways

  • Minimum – the closest distance between two points
  • Maximum – farthest distance between two points
  • Average – an average of the distances between two clusters
  • Euclidean distance

Space and Time Complexity – HCA

Space Complexity: It requires to store the similarity matrix in the RAM it’s complexity is O(n2).

Time Complexity: Each iteration and updating of the similarity matrix requires O(n3).

Disadvantage of HCA

This algorithm doesn’t work with huge data since it has high space and time complexity.

Also, read- Image Classification in python

Thank you

Leave a Reply

Your email address will not be published. Required fields are marked *