BIRCH Clustering

Balanced Iterative Reducing and Clustering using Hierarchies

This is an unsupervised clustering algorithm.

Birch will cluster a multidimensional dataset into k-clusters. Why do we want to use it ? It's optimized for large datasets, deals with outliers, and minimizes disk IO. It's very similar to a B+ tree - but instead of each node having a key, it has a summary statistic we call a cluster feature, or, $CF$ .

Why is it awesome? Wikipedia:

Previous clustering algorithms performed less effectively over very large databases and did not adequately consider the case wherein a data-set was too large to fit in main memory. As a result, there was a lot of overhead maintaining high clustering quality while minimizing the cost of addition IO (input/output) operations. Furthermore, most of BIRCH's predecessors inspect all data points (or all currently existing clusters) equally for each 'clustering decision' and do not perform heuristic weighting based on the distance between these data points.

So how does it work? We have a few steps to go through. It is basically doing two phases of clustering. First, it does a simple low-level clustering (sometimes called micro-clustering). After that, it does higher-level macro clustering.

Background

First, we have our dataset $X$ , that is, the first point is $X_1$ , the second $X_2$ , and so on to $X_n$ . Birch clustering uses a clustering feature tree (also calleda a characteristic feature tree), which we'll just call a $CF$ tree. A $CF$ has 3 components:

$N$ - the number of data points
$LS$ : linear sum of $N$ points: $\sum_{n=1}^{n} X_i$
$SS$ : squared sum of $N$ points: $\sum_{n=1}^{n} X_i^2$

So we have,

$CF = (N, LS, SS)$

Here is a small example of calculating a single $CF$ :

x	y
3	8
5	1
1	10
2	6
8	3

In this case,

$N$ is five because we have five points
$LS$ is the tuple $(19, 28)$ , that is, the sum of x values and the sum of y values.
$SS$ is the tuple $(103, 210)$ , that is, the sum of squared x and squared y values.
$CF = (5, (19,28), (103,210))$

There are a few other things BIRCH will calculate:

centroid- $x_0$ - the middle of the cluster: this is to avoid unnecessary recalculation of $\frac{LL}{N}$

$x_0 = \frac{\sum_{i=1}^n X_i}{n}$

distance - D - average pairwise distance

$D = \sqrt{\frac{\sum_{i=1}^n\sum_{j=1}^n(x_i-x_0)^2}{n(n-1)}}$

radius - R - average distance from a member to the centroid

$R = \sqrt{\frac{\sum_{i=1}^n(x_i-x_0)^2}{n}}$

We might want to also define the distance between the centroids, for example, Euclidean distance:

$D_e = \sqrt{(X0_1-X0_2)^2}$

So now we've got the the basic terms down.

Phase 1 : Initalize CF Tree

Initalize the CF tree. Scan all of the data and create a $CF$ tree using a given amount of memory and recycling space on disk. This tree will try to get as granular as possible given this limit. This phase ultimately is creating a summary of all of the data via the $CF$ values. Sparse data points are removed during this phase as well. The original paper points out three reasons why this important for future computations:

fast because no I/O is needed and the problem of clustering the original data is reduced to a smaller problem of just clustering the subclusters;
accurate because outliers are elimated and the remaining data is as grouped as finely as possible (given the memory constraint);
it is less order sensitive since the leaf entries of the inital tree form an input order that contains better data locality when you compare it with the arbitrary original data input ordering.

The relevant options that we can actually control in scikit-learn would be the $T$ value, or threshold, which is set to 0.5 by default.

The $T$ value is important because it allows us to determine whether or not it is valid to merge a $CF$ with another - thus determing how sensitive (and how large) the process is.

In the scikit-learn implementation, the $R$ value we mentioned earlier is compared to the $T$ value. scikit-learn says:

The radius of the subcluster obtained by merging a new sample and the closest subcluster should be lesser than the threshold. Otherwise a new subcluster is started.

Secondly, we have the $B$ branching factor, which is set to 50 by default. scikit-learn says:

Maximum number of CF subclusters in each node. If a new samples enters such that the number of subclusters exceed the branching_factor then the node has to be split. The corresponding parent also has to be split and if the number of subclusters in the parent is greater than the branching factor, then it has to be split recursively.

Pretty much this phase does the vast majority of all the work.

Phase 2 : Optional Condense

A repeat of phase 1 for the most part in order to help bridge the gap between phase 1 and phase 3 - here you can optionally scan the leaf entries again to rebuild a smaller $CF$ tree, remove more outliers, and group crowded subclusters into larger ones.

Phase 3 : (Semi-)Global Clustering

In phase 3, we cluster all the leaf entries again - treat each subcluster as a data point. There are different ways you could do this - in the paper, as well as in scikit-learn, agglomerative hierarchical clustering is used.

Phase 4 : Optional Redistribution

Phase 4 looks at the centroids of the clusters produced by phase 3 as "seeds". It then redistributes the data points to the closest seed in order to get an even newer set of clusters.

Phase 4 can also can allow you to discard more outliers, e.g., if the distance to a seed is too far, it is an outlier.

Want to read more?

You can read the original paper here: Zhang, Ramakrishnan & Livny 2006
Here are some nice slides on the algo by Tian Zhang
Youtube video explanation

BIRCH