| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by falkenb0t 1554 days ago

I found myself having to do some clustering a few months back and learned about mean-shift clustering [1]. I found it to be a nice clustering algorithm for when you don't know quantity of clusters ahead of time.

That said, its worth noting that there are algorithms out there for determining optimal number of clusters for k-means (though personally I found them to be costly and subject to overfitting) like using the silhouette coefficient or elbow method [2].

[1]: https://www.geeksforgeeks.org/ml-mean-shift-clustering/ [2]: https://towardsdatascience.com/k-means-clustering-how-it-wor...

3 comments

nerdponx 1554 days ago

Note also that specifically for one-dimensional data, there is a globally optimal solution to the k-means clustering problem. There is an R package that implements it using a C++ core implementation [1], and also a Python wrapper [2].

This implementation is also surprisingly fast, so you can use it to brute-force check many different numbers of clusters and check using silhouette distance. The advantage over traditional k-means is that you don't need to check multiple initializations for any given number of clusters, because the algorithm is deterministic and guaranteed to find the global optimum.

[1]: https://cran.r-project.org/package=Ckmeans.1d.dp

[2]: https://github.com/djdt/ckwrap

link

Royi 1554 days ago

Is there a paper or a post which describes the solution?

link

nalrey 1554 days ago

I just want to mention that these algorithms were known way back and it is sad that the authors of CRAN package do not acknowledge it in their docs:

1. O(NK^2) dynamic programming from 1965 by James Bruce (https://dspace.mit.edu/bitstream/handle/1721.1/4396/RLE-TR-4...)

2. O(KNlogN) dynamic + divide&conquer algorighm from 1989 from Xiaolin Wu and John Rokne (http://doi.acm.org/10.1145/75427.75472)

3. O(NK) algorithm from 1991 by Xiaolin Wu (http://dx.doi.org/10.1016/0196-6774(91)90039-2)

link

Royi 1552 days ago

Is there any implementation in Python / Julia / MATLAB?

link

nerdponx 1554 days ago

Yes, the R package was written by the authors of the paper, and the CRAN info page references the paper.

link

CuriouslyC 1554 days ago

A bigger problem with kmeans than figuring out the number of clusters is the fact that the model assumes multivariate gaussian spheres. That's a bad model for most non-toy data.

link

thesz 1553 days ago

It is not necessarily the case.

For example, word2vec uses k-means clustering using cosine similarity measure [1]. It works very, very well. The caveat is not many optimization variations of k-means will work with that "distance".

[1] https://github.com/tmikolov/word2vec/blob/master/word2vec.c#...

link

dorgo 1553 days ago

My feeling is that if for a given problem "cluster center" is a meaningful concept then k-means is the right tool. The concept of "distance" can be adjusted as well. I think any metric will do. But if you want clusters without a "center of mass" then you are faced with a whole other problem and need other tools.

link

Royi 1554 days ago

Not only limited to Gaussian Spheres but also being isotropic (Unless data is pre processed or distance is defined by Mahalanobis Distance).

link

jacquesm 1554 days ago

Can you explain this please?

link

CuriouslyC 1554 days ago

The model underlying k-means is that all the data is distributed into k hyperspheres. In the simple 2D case, that means drawing k circles around your data points in a X/Y plot such that the inter-group variance is minimized. This is bad because in the real world, data is typically grouped in an elliptical or irregular way.

There are some examples of this at https://stats.stackexchange.com/questions/133656/how-to-unde...

link

jacquesm 1553 days ago

Thank you!

link

civilized 1554 days ago

The elbow method exists in many fields. It essentially boils down to "hope that an obviously good choice exists"

link