Hacker News new | ask | show | jobs
by piemonkey 3482 days ago
> K-means is based on some quite wild assumptions - your data follows a specific case of the Gaussian distribution. Plus side is that the algorithm is relatively easy to understand and implement so it is a good starting point into clustering.

K-means is a non-parametric clustering algorithm, which assumes no underlying properties about your data. It is hard to evaluate the quality of a clustering algorithm without some clearly defined objective. In particular, k-mean's cousin in the classification community, k-nearest neighbors, actually is provably a a Bayes optimal decision strategy as k and the number of samples increases, regardless of the underlying distribution of your data (i.e., it is strongly consistent, see [0]).

> K-means can be levelled up with the EM algorithm

In fact, the k-means algorithm is actually a special case of fitting a mixture of k Gaussian distributions where all the Gaussians are isotropic (have identical covariance matrices with uniform elements on the diagonal, i.e. a scaled identity matrix).

[0] https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm