Skip to content

Latest commit

 

History

History
79 lines (44 loc) · 2.27 KB

File metadata and controls

79 lines (44 loc) · 2.27 KB

Clustering

K-Means

K-Means Algorithm

  1. Add K centroids to the data at random positions.

Add centroids

  1. Associate each data point to the closest centroid (aka association step)

Associate step

  1. Move the centroids to the mean distance between all associated points

Move centroids

  1. Repeat step 2 and 3 n times, or until some other stop-condition has been met.

K-Means is not deterministic

The initial position of the centroids will influence the final outcome of the algorithm. See the example below:

uniform 1

uniform 2

To solve this problem, we run the algorithm multiple times and average the results.

K-Means and sklearn

class sklearn.cluster.KMeans(n_clusters=8, init='k-means++', n_init=10, max_iter=300, 
                             tol=0.0001, precompute_distances='auto', verbose=0, 
                             random_state=None, copy_x=True, n_jobs=1, algorithm='auto')
  • n_clusters: number of centroids to initialize. Also defines the number of clusters to be found. This should be set using domain knowledge of the problem.
  • max_iter: number of iterations (associate points, move centroids, repeat) to be run.
  • n_init: number of times the algorithm will run before outputing the results.

K-means references

Single Linkage Clustering

Single Linkage Clustering Algorithm

Single Linkage Clustering

Soft Clustering

soft clustering

  • Can assign the same point to multiple clusters
  • Probabilistic approach

Expectation Maximization

Expectation Maximization formula

Expectation Maximization Properties

Expectation Maximization Properties

Clustering Properties

Clustering Properties

Impossibility Theorem

Impossibility Theorem

Summary

Summary