LECTURE - 22

Unsupervised Algorithm : K-Means Clustering

Soumyadeep Khandual (1711134)

Introduction

Unsupervised learning is a type of machine learning algorithm used to draw inferences from datasets consisting of input data without any labels. The most common unsupervised learning method is cluster analysis, which is used for exploratory data analysis to find hidden patterns or grouping in data. The clusters are modelled using a measure of similarity which is defined upon metrics such as Euclidean or probabilistic distance. One of the widely used clustering algorithm is k-Means clustering. It partitions data into k distinct clusters based on distance to the centroid of a cluster

Algorithm

Kmeans algorithm iteratively tries to partition the dataset into K distinct non-overlapping clusters. Each data point belongs to only one group. It tries to minimize the sum of the squared distance between the data points and the cluster’s centroid (arithmetic mean of all the data points that belong to that cluster). The lesser the variation, the closer the data points are to the centroid and hence the more similar the data points are to each other in a cluster.

Working of K-means algorithm:

Assign K random feature vectors from data set as centroids.
Calculate the Euclidean distance of each feature vector to the centroids.
Assign the closest centroid to each data point.
Calculate the arithmetic mean of all data points that belong to the cluster and assign that as new centroid.
Repeat step 2-4 till the new centroid does not change or changes slightly.

k-means pseudocode — Fig.1 - Pseudocode for K-Means taken from CIML book [1]

Convergence

The approach kmeans follows is greedy approach where the algorithm tries to locally minimize the distance between each point to the respective centroid. Given we have a finite input data set, we can always find finitely many subgroups, thus K-Means is guaranteed to converge but it may happen that the value of K is large. The particular objective function that K-Means is optimizing is the sum of squared distances from any data point to its assigned centroid. $$ L = \sum_{i=1} ^{m} \sum_{k=1} ^{K} w_{ik} (x_i - \mu_k)^2 $$ It’s a minimization problem of two parts. We first minimize $ L $ w.r.t. $ w_{ik} $ and treat $ \mu_k $ fixed. Then we minimize $ L $ w.r.t. $ \mu_k $ and treat $ w_{ik} $ fixed. This double minimization ensures that the clusters have maximum separation and the points inside the clusters are as close as possible.

Hyperparameter

There is only one hyperparameter in K-Means i.e "K". The number of cluster cannot be found by the algorithm. The algorithm only segregates data into the given number of clusters. To find the optimum value of K we use a technique called ELBOW METHOD. As k increases, the sum of squared distance $L$ tends to zero, being zero when k is equal to the number of data points. A good approximation of K is when further increasing K does not significantly decrease $L$ than the previous K. Plotting $L$ Vs K gives a elbow like curve and the elbow point gives a good approximation to K. In the example given below K = 2 is good approximation.

Sample Code with example

We will be using sklearn package to generate clusters and implement K-Means algorithm for the example. A detailed implementation of K-Means from scratch can be found here.
STEP 1 : Generate data set.

            
                from sklearn.datasets import make_blobs
                X, y_true = make_blobs(n_samples=300, centers=2,        #generate 300 data points
                                       cluster_std=1, random_state=1)   #with two clusters
                plt.scatter(X[:, 0], X[:, 1], s=50))                    #plot raw data

STEP 2 : Apply Kmeans.

            
                from sklearn.cluster import KMeans
                kmeans = KMeans(n_clusters=2)
                kmeans.fit(X)
                y_kmeans = kmeans.predict(X)
                plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='viridis') #plot

Applications

k-means clustering is rather easy to apply to even large data sets. It has been successfully used in market segmentation, computer vision, and astronomy. K-means is widely used by search engines. When a search is performed, the search results need to be grouped, and the search engines very often use clustering to do this. Clustering can also be used in grading academic performances. Based on the scores, students are categorized into grades like A, B, or C.

Limitations

Kmeans assumes spherical shapes of clusters (with radius equal to the distance between the centroid and the furthest data point) and doesn’t work well when clusters are in different shapes. Consider the below example, kmeans algorithm doesn’t let data points that are far-away from each other share the same cluster even though they obviously belong to the same cluster.

            
                from sklearn.datasets import make_moons
                X, y = make_moons(200, noise=.05, random_state=0)
                plt.scatter(X[:, 0], X[:, 1], s=50)
                labels = KMeans(2, random_state=0).fit_predict(X)
                plt.scatter(X[:, 0], X[:, 1], c=labels,
                            s=50, cmap='viridis')

k-means moon limitation — Fig.4 - Raw data(left) Clustering by K-means(right)

References

[1] A Course in Machine Learning, Hal Daumé III
[2] MathWorks, Kmeans
[3] K-means Clustering: Algorithm, Applications, Evaluation Methods, and Drawbacks by Imad Dabbura
[4] Tutorial: How to determine the optimal number of clusters for k-means clustering, By Tola Alade
[5] Lecture on K-means By Dr. S Mishra, NISER