K-Means

This is a unsupervised clustering algorithm.

K-Means will group our data into clusters so that it minimizes the sum of squared distances from the centroid of the cluster.

Why is it awesome ? Wikipedia:

k-means clustering in particular when using heuristics such as Lloyd's algorithm is rather easy to implement and apply even on large data sets. As such, it has been successfully used in various topics, including market segmentation, computer vision, geostatistics, astronomy and agriculture. It often is used as a preprocessing step for other algorithms, for example to find a starting configuration.

The article also mentions applications in:

vector quantization
cluster analysis
feature learning

Algorithm

If $(x_1,x_2,...x_n)$ are data points, and we want to group things into $k$ sets $S = (s_1, s_2,...s_k)$ , then minimizing the within cluster sum of squares (WCSS) amounts to:

$\underset{\mathbf{S}} {\operatorname{arg\,min}} \sum_{i=1}^{k} \sum_{\mathbf x \in S_i} \left\| \mathbf x - \boldsymbol\mu_i \right\|^2$

The most popular / common algorithm is called Lloyd's Algorithm, named after Steward P. Lloyd. It requires some type of initialization. The two main initializations, Forgy and Random Parition, are both variations of choosing a random point to be the inital mean for each cluster.

Initalization - pick random points !

Lloyd's algorithm consists of three steps:

Assignment (or Expectation) Step
- Assign each point to the cluster whose mean gives the smallest WCSS

Update (or Maximization) Step
- Calculate new means

Iterate
- Repeat until "good enough". That is, centers don't change very much or, more commonly, a certain number of iterations has been reached. Scikit-learn will let you set the number of iterations with max_iter.

KMeans

K-Means

Algorithm

Interactive Demonstration

results matching ""

No results matching ""