DBSCAN

Density-Based Spatial Clustering of Applications with Noise

This is an unsupervised clustering algorithm.

DBSCAN will identify points that have lots of neighbors in some space. We use this number-of-neighbors approach to separate clusters and find and/or remove outliers. So outliers just have few neighbors, or, their neighbors are too far away.

Why is it awesome? Wikipedia:

DBSCAN is one of the most common clustering algorithms and also most cited in scientific literature.

In 2014, the algorithm was awarded the test of time award (an award given to algorithms which have received substantial attention in theory and practice) at the leading data mining conference, KDD.

Boom.

Algorithm

We have a bunch of data points - and we want to decide if they are either core or non-core points. We do so by establishing two parameters: minimum number of neighbors, and some distance, which we'll call (epsilon). For now, we don't have to decide on what exactly that distance is, just that there is some distance metric. Scitkit-learn lets you use quite a few.

In this image, we can see all the red points in cluster A being joined because they are within some set value (the distance of the rings). Points B & C can be thought of as "edges" since they don't allow you reach any other points, whereas point N is an outlier.

Image Attribution: Creative Commons from Wikipedia

We also have to decide on a minimum number of samples that establish a cluster as valid. The default in scikit is 5.

Choosing Epsilon

There are lots of people who have tackled this question - here is one example from spark. However, it almost always boils down to what you are trying to do. People have used DBSCAN for anomaly detection1, whereas other times, it's used in a more conventional clustering scenario.

There are also modifications of variants of DBSCAN that address this issue:

Implementations

Scikit-learn has some options for us:

  • Nearest Neighbors
  • KDTree
  • Ball Tree
  • Brute Force

If you'd like to get into the nitty gritty details of which one to use when, please see this page.

results matching ""

    No results matching ""