Introduction to K-means algorithm
In this tutorial, I am going to discuss a very famous clustering algorithm, K-means algorithm. Clustering algorithms are used to group similar items in same group called cluster. This post will be a basic introduction to k-means algorithm.
Overview
You might be wondering what are clustering algorithms? These algorithms perform the task of putting similar items in the same group. These algorithms come under Unsupervised Learning because of the fact we are unknown what are output would be and they help in discovering patterns.
Clustering can be of two kinds –
- One is distance based in which we compute distance between points and then form clusters.
- Second is concept based in which clusters are formed when points are similar in properties and behavior. Goal of clustering is to find grouping in unlabeled data available.
A very good application is customer segmentation. For instance, a person owns a big mart. Now he wants to segment the customers based on their shopping behavior. What he could do is initially take people in batches of 10-20. Then apply any algorithm to form groups of people iteratively based on features(behavior) observed by him . For eg- people who buy chocolates can be put in one group and those buying chips in other group.
Introduction to K-Means algorithm
K-Means is among one of the popular algorithm used in clustering. This is based on distance computation. The basic motive of the algorithm is to assign similar points to K groups iteratively based on distance computations. Distance metric used can be Euclidean distance, manhattan distance or any other.
Algorithm:
- Out of data set available, randomly choose K points. These points are called clusters.
- Now compute distance of all other points from these K clusters.
- Compare the distance computed from each cluster for each point.
- Assign the data point to the cluster from which the distance computed was minimum.
- Now compute the centroid of newly formed cluster. This can be calculated by taking average of x-coordinate and y-coordinate of all the points present in cluster. For example if there are three points (X1, Y1), (X2, Y2) and (X3, Y3). Its centroid will be (X’ , Y’) =( (X1 + X2 + X3) / 3 , (Y1 + Y2 + Y3) /3 ).
- Now again run the algorithm to compute the distance of points from new cluster. Again do same comparison and assign points to clusters.
- Run the algorithm till the clusters remain unchanged.
Advantages:
- This algorithm is computationally faster when variables are large but condition is K should be small.
- If we compare among other clustering algorithms, this algorithm form more compact clusters.
Disadvantages:
- As there is no specific method to calculate K, results can be different for different value of K.
- It is often time consuming to assume value of K.
Applications:
- Customer Segmentation
- Insurance fraud detection
- Detecting crime-prone areas
- Data analysis
Distance Measures:
- Euclidean Distance = √ ( ∑( Xi – Yi )^2 ) where i = 1,2, ….n
- Manhattan Distance = abs( ∑(Xi – Yi) ) where i = 1,2, ….n
- Cosine Similarity = ( X . Y ) / ( norm(X) * norm(Y) ) where X.Y is dot product of vectors X and Y, nor(X) refers to normalized X vector. This is generally applied to binary values.
So, this was basic introduction to k-means algorithm. In further tutorials, I will discuss its implementation in python.
Also, give a read to Introduction to Natural Language Processing- NLP.
Feel free to post your doubts in the comments section.
Leave a Reply