| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by CuriouslyC 1554 days ago
	A bigger problem with kmeans than figuring out the number of clusters is the fact that the model assumes multivariate gaussian spheres. That's a bad model for most non-toy data.

4 comments

thesz 1553 days ago

It is not necessarily the case.

For example, word2vec uses k-means clustering using cosine similarity measure [1]. It works very, very well. The caveat is not many optimization variations of k-means will work with that "distance".

[1] https://github.com/tmikolov/word2vec/blob/master/word2vec.c#...

link

dorgo 1553 days ago

My feeling is that if for a given problem "cluster center" is a meaningful concept then k-means is the right tool. The concept of "distance" can be adjusted as well. I think any metric will do. But if you want clusters without a "center of mass" then you are faced with a whole other problem and need other tools.

link

Royi 1554 days ago

Not only limited to Gaussian Spheres but also being isotropic (Unless data is pre processed or distance is defined by Mahalanobis Distance).

link

jacquesm 1554 days ago

Can you explain this please?

link

CuriouslyC 1554 days ago

The model underlying k-means is that all the data is distributed into k hyperspheres. In the simple 2D case, that means drawing k circles around your data points in a X/Y plot such that the inter-group variance is minimized. This is bad because in the real world, data is typically grouped in an elliptical or irregular way.

There are some examples of this at https://stats.stackexchange.com/questions/133656/how-to-unde...

link

jacquesm 1553 days ago

Thank you!

link