Soumyadeep Khandual (1711134)
Unsupervised learning is a type of machine learning algorithm used to draw inferences from datasets consisting of input data without any labels. The most common unsupervised learning method is cluster analysis, which is used for exploratory data analysis to find hidden patterns or grouping in data. The clusters are modelled using a measure of similarity which is defined upon metrics such as Euclidean or probabilistic distance. One of the widely used clustering algorithm is k-Means clustering. It partitions data into k distinct clusters based on distance to the centroid of a cluster
Kmeans algorithm iteratively tries to partition the dataset into K distinct non-overlapping clusters.
Each data point belongs to only one group. It tries to minimize the sum of the squared distance
between the data points and the cluster’s centroid (arithmetic mean of all the data points that belong to that cluster).
The lesser the variation, the closer the data points are to the centroid and hence the more similar the data points
are to each other in a cluster.
Working of K-means algorithm:
The approach kmeans follows is greedy approach where the algorithm tries to locally minimize the distance between each point to the respective centroid. Given we have a finite input data set, we can always find finitely many subgroups, thus K-Means is guaranteed to converge but it may happen that the value of K is large. The particular objective function that K-Means is optimizing is the sum of squared distances from any data point to its assigned centroid. $$ L = \sum_{i=1} ^{m} \sum_{k=1} ^{K} w_{ik} (x_i - \mu_k)^2 $$ It’s a minimization problem of two parts. We first minimize \( L \) w.r.t. \( w_{ik} \) and treat \( \mu_k \) fixed. Then we minimize \( L \) w.r.t. \( \mu_k \) and treat \( w_{ik} \) fixed. This double minimization ensures that the clusters have maximum separation and the points inside the clusters are as close as possible.
There is only one hyperparameter in K-Means i.e "K". The
number of cluster cannot be found by the algorithm. The algorithm
only segregates data into the given number of clusters. To find the
optimum value of K we use a technique called ELBOW METHOD.
As k increases, the sum of squared distance \(L\) tends to zero,
being zero when k is equal to the number of data points. A good
approximation of K is when further increasing K does not significantly
decrease \(L\) than the previous K. Plotting \(L\) Vs K gives a
elbow like curve and the elbow point gives a good approximation to K.
In the example given below K = 2 is good approximation.
We will be using sklearn package to generate clusters and implement K-Means algorithm for the example. A detailed
implementation of K-Means from scratch can be found here.
STEP 1 : Generate data set.
from sklearn.datasets import make_blobs
X, y_true = make_blobs(n_samples=300, centers=2, #generate 300 data points
cluster_std=1, random_state=1) #with two clusters
plt.scatter(X[:, 0], X[:, 1], s=50)) #plot raw data
STEP 2 : Apply Kmeans.
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=2)
kmeans.fit(X)
y_kmeans = kmeans.predict(X)
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='viridis') #plot
k-means clustering is rather easy to apply to even large data sets. It has been successfully used in market segmentation, computer vision, and astronomy. K-means is widely used by search engines. When a search is performed, the search results need to be grouped, and the search engines very often use clustering to do this. Clustering can also be used in grading academic performances. Based on the scores, students are categorized into grades like A, B, or C.
Kmeans assumes spherical shapes of clusters (with radius equal to the distance between the centroid and the furthest data point) and doesn’t work well when clusters are in different shapes. Consider the below example, kmeans algorithm doesn’t let data points that are far-away from each other share the same cluster even though they obviously belong to the same cluster.
from sklearn.datasets import make_moons
X, y = make_moons(200, noise=.05, random_state=0)
plt.scatter(X[:, 0], X[:, 1], s=50)
labels = KMeans(2, random_state=0).fit_predict(X)
plt.scatter(X[:, 0], X[:, 1], c=labels,
s=50, cmap='viridis')
[1] A Course in Machine Learning, Hal Daumé III
[2] MathWorks, Kmeans
[3] K-means Clustering: Algorithm, Applications, Evaluation Methods, and Drawbacks by Imad Dabbura
[4] Tutorial: How to determine the optimal number of clusters for k-means clustering, By Tola Alade
[5] Lecture on K-means By Dr. S Mishra, NISER