Hacker News new | ask | show | jobs
by CuriouslyC 1554 days ago
A bigger problem with kmeans than figuring out the number of clusters is the fact that the model assumes multivariate gaussian spheres. That's a bad model for most non-toy data.
4 comments

It is not necessarily the case.

For example, word2vec uses k-means clustering using cosine similarity measure [1]. It works very, very well. The caveat is not many optimization variations of k-means will work with that "distance".

[1] https://github.com/tmikolov/word2vec/blob/master/word2vec.c#...

My feeling is that if for a given problem "cluster center" is a meaningful concept then k-means is the right tool. The concept of "distance" can be adjusted as well. I think any metric will do. But if you want clusters without a "center of mass" then you are faced with a whole other problem and need other tools.
Not only limited to Gaussian Spheres but also being isotropic (Unless data is pre processed or distance is defined by Mahalanobis Distance).
Can you explain this please?
The model underlying k-means is that all the data is distributed into k hyperspheres. In the simple 2D case, that means drawing k circles around your data points in a X/Y plot such that the inter-group variance is minimized. This is bad because in the real world, data is typically grouped in an elliptical or irregular way.

There are some examples of this at https://stats.stackexchange.com/questions/133656/how-to-unde...

Thank you!