The DBSCAN Clustering Algorithm

Spandan Anupam[a]


Table of Contents

1 Introduction
2 Parameters
3 Algorithm
4 Working Example
5 Advantages and Disadvantages
6 References

Introduction

DBSCAN stands for Density Based Spatial Clustering of Applications with Noise. This is a handy little unsupervised learning algorithm successful at density dependent scans on point clusters. At the very base, the idea behind all the algorithms we see, is to find out similarities between the points we want to cluster, and bring them together. This is no different, it looks at nearby points, and reels them in if they are within a certain specified radius. We can see two obvious parameters here, that I will mention.

Parameters

  1. Epsilon (\(\epsilon\)): Represents (or defines) the neighborhood around a point. If any points are within \(\epsilon\) distance, they are marked as neighbors
  2. minPoints: Minimum number of points that need to be within the \(\epsilon\) radius, for the central point to be considered as a core-point, or for the region to be considered dense.

Algorithm

Coming to the actual algorithm. DBSCAN works by first picking up a random point in the dataset. Now that data point can be classified into three different classes. It will be a:

So, the algorithm scans across the core-points to find more core points and recursively carries on this excercise till all the points have been exhausted. It is important to note here, that noise points are being allowed, unlike k-means clustering, which will desperately fit all the points in some or the other class, even if it is an outlier. This, however, doesnt take in the number of classes in the first place. It just depends on the radius that you choose to look at, and your definition of dense, which seems to be a natural way of thinking.

Working Example

Using sklearn's very simple API, we can construct an example for DBSCAN clustering. I chose this specific example, because many other algorithms fail at this, due to the overlapping parts of the moons. Non density based algorithms cannot seem to find a proper distinction between the two clusters, which we can see from the example plots after this snippet.

The plot on the right is the one we obtain on DBSCAN clustering. The plot on the left, is KMeans. We can clearly see the point of failure here, and why one would prefer one over the other.

The advantages that seem to stand out the most, to me are:

The disadvantages that stand out are:

References

  1. Wikipedia: DBSCAN
  2. DBSCAN: Advantages
  3. DBSCAN: What is it? When to Use it? How to use it.
  4. CS460 Lectures, Prof S Mishra