1 Introduction
2 Parameters
3 Algorithm
4 Working Example
5 Advantages and Disadvantages
6 References
DBSCAN stands for Density Based Spatial Clustering of Applications with Noise. This is a handy little unsupervised learning algorithm successful at density dependent scans on point clusters. At the very base, the idea behind all the algorithms we see, is to find out similarities between the points we want to cluster, and bring them together. This is no different, it looks at nearby points, and reels them in if they are within a certain specified radius. We can see two obvious parameters here, that I will mention.
Epsilon
(\(\epsilon\)): Represents (or defines) the neighborhood around a point. If any points are within \(\epsilon\) distance, they are marked as neighbors
minPoints
: Minimum number of points that need to be within the \(\epsilon\) radius, for the central point to be considered as a core-point, or for the region to be considered dense.
Coming to the actual algorithm. DBSCAN works by first picking up a random point in the dataset. Now that data point can be classified into three different classes. It will be a:
minPoints
number of points in its \(\epsilon\) neighborhood
Using sklearn's very simple API, we can construct an example for DBSCAN clustering. I chose this specific example, because many other algorithms fail at this, due to the overlapping parts of the moons. Non density based algorithms cannot seem to find a proper distinction between the two clusters, which we can see from the example plots after this snippet.
The plot on the right is the one we obtain on DBSCAN
clustering. The plot on the left, is KMeans
. We can clearly see the point of failure here, and why one would prefer one over the other.
The advantages that seem to stand out the most, to me are:
numClasses
parameter