K-Means

This is a unsupervised clustering algorithm.

K-Means will group our data into clusters so that it minimizes the sum of squared distances from the centroid of the cluster.

Why is it awesome ? Wikipedia:

k-means clustering in particular when using heuristics such as Lloyd's algorithm is rather easy to implement and apply even on large data sets. As such, it has been successfully used in various topics, including market segmentation, computer vision, geostatistics, astronomy and agriculture. It often is used as a preprocessing step for other algorithms, for example to find a starting configuration.

The article also mentions applications in:

  • vector quantization
  • cluster analysis
  • feature learning

Algorithm

If are data points, and we want to group things into sets , then minimizing the within cluster sum of squares (WCSS) amounts to:

The most popular / common algorithm is called Lloyd's Algorithm, named after Steward P. Lloyd. It requires some type of initialization. The two main initializations, Forgy and Random Parition, are both variations of choosing a random point to be the inital mean for each cluster.

Initalization - pick random points !

Lloyd's algorithm consists of three steps:

  • Assignment (or Expectation) Step
    • Assign each point to the cluster whose mean gives the smallest WCSS

  • Update (or Maximization) Step
    • Calculate new means

  • Iterate
    • Repeat until "good enough". That is, centers don't change very much or, more commonly, a certain number of iterations has been reached. Scikit-learn will let you set the number of iterations with max_iter.

All images are Create Commons from here.

Interactive Demonstration

Naftali Harris has an excellent demo here.

results matching ""

    No results matching ""