CS460: DBSCAN Algorithm

The DBSCAN Clustering Algorithm

Spandan Anupam^[a]

1 Introduction
2 Parameters
3 Algorithm
4 Working Example
5 Advantages and Disadvantages
6 References

Introduction

DBSCAN stands for Density Based Spatial Clustering of Applications with Noise. This is a handy little unsupervised learning algorithm successful at density dependent scans on point clusters. At the very base, the idea behind all the algorithms we see, is to find out similarities between the points we want to cluster, and bring them together. This is no different, it looks at nearby points, and reels them in if they are within a certain specified radius. We can see two obvious parameters here, that I will mention.

Parameters

Epsilon (\(\epsilon\)): Represents (or defines) the neighborhood around a point. If any points are within \(\epsilon\) distance, they are marked as neighbors
minPoints: Minimum number of points that need to be within the \(\epsilon\) radius, for the central point to be considered as a core-point, or for the region to be considered dense.

Algorithm

Coming to the actual algorithm. DBSCAN works by first picking up a random point in the dataset. Now that data point can be classified into three different classes. It will be a:

Core Point: If it has atleast minPoints number of points in its \(\epsilon\) neighborhood
Border Point: If the point is in \(\epsilon\) neighborhood of some point, but isnt a core-point in itself
Noise: If the point doesn't fall in the neighborhood of any point. Or equivalently, if it doesn't have any neighbors.

So, the algorithm scans across the core-points to find more core points and recursively carries on this excercise till all the points have been exhausted. It is important to note here, that noise points are being allowed, unlike k-means clustering, which will desperately fit all the points in some or the other class, even if it is an outlier. This, however, doesnt take in the number of classes in the first place. It just depends on the radius that you choose to look at, and your definition of dense, which seems to be a natural way of thinking.

Working Example

Using sklearn's very simple API, we can construct an example for DBSCAN clustering. I chose this specific example, because many other algorithms fail at this, due to the overlapping parts of the moons. Non density based algorithms cannot seem to find a proper distinction between the two clusters, which we can see from the example plots after this snippet.

The plot on the right is the one we obtain on DBSCAN clustering. The plot on the left, is KMeans. We can clearly see the point of failure here, and why one would prefer one over the other.

Advantages and Disadvantages

The advantages that seem to stand out the most, to me are:

Can look out for outliers (major)
No need for a numClasses parameter
Ability to identify uneven shapes
It is easy for someone who knows the dataset, to set the parameters

The disadvantages that stand out are:

Can be confused when theres a border point that belongs to two clusters
Cannot cluster well with huge differences in densities. Variable density clusters are mostly a no go
Results depend highly on the distance metric
Can be hard to guess the correct parameters for an unknown dataset