Hacker News new | ask | show | jobs
by falkenb0t 1554 days ago
I found myself having to do some clustering a few months back and learned about mean-shift clustering [1]. I found it to be a nice clustering algorithm for when you don't know quantity of clusters ahead of time.

That said, its worth noting that there are algorithms out there for determining optimal number of clusters for k-means (though personally I found them to be costly and subject to overfitting) like using the silhouette coefficient or elbow method [2].

[1]: https://www.geeksforgeeks.org/ml-mean-shift-clustering/ [2]: https://towardsdatascience.com/k-means-clustering-how-it-wor...

3 comments

Note also that specifically for one-dimensional data, there is a globally optimal solution to the k-means clustering problem. There is an R package that implements it using a C++ core implementation [1], and also a Python wrapper [2].

This implementation is also surprisingly fast, so you can use it to brute-force check many different numbers of clusters and check using silhouette distance. The advantage over traditional k-means is that you don't need to check multiple initializations for any given number of clusters, because the algorithm is deterministic and guaranteed to find the global optimum.

[1]: https://cran.r-project.org/package=Ckmeans.1d.dp

[2]: https://github.com/djdt/ckwrap

Is there a paper or a post which describes the solution?
I just want to mention that these algorithms were known way back and it is sad that the authors of CRAN package do not acknowledge it in their docs:

1. O(NK^2) dynamic programming from 1965 by James Bruce (https://dspace.mit.edu/bitstream/handle/1721.1/4396/RLE-TR-4...)

2. O(KNlogN) dynamic + divide&conquer algorighm from 1989 from Xiaolin Wu and John Rokne (http://doi.acm.org/10.1145/75427.75472)

3. O(NK) algorithm from 1991 by Xiaolin Wu (http://dx.doi.org/10.1016/0196-6774(91)90039-2)

Is there any implementation in Python / Julia / MATLAB?
Yes, the R package was written by the authors of the paper, and the CRAN info page references the paper.
A bigger problem with kmeans than figuring out the number of clusters is the fact that the model assumes multivariate gaussian spheres. That's a bad model for most non-toy data.
It is not necessarily the case.

For example, word2vec uses k-means clustering using cosine similarity measure [1]. It works very, very well. The caveat is not many optimization variations of k-means will work with that "distance".

[1] https://github.com/tmikolov/word2vec/blob/master/word2vec.c#...

My feeling is that if for a given problem "cluster center" is a meaningful concept then k-means is the right tool. The concept of "distance" can be adjusted as well. I think any metric will do. But if you want clusters without a "center of mass" then you are faced with a whole other problem and need other tools.
Not only limited to Gaussian Spheres but also being isotropic (Unless data is pre processed or distance is defined by Mahalanobis Distance).
Can you explain this please?
The model underlying k-means is that all the data is distributed into k hyperspheres. In the simple 2D case, that means drawing k circles around your data points in a X/Y plot such that the inter-group variance is minimized. This is bad because in the real world, data is typically grouped in an elliptical or irregular way.

There are some examples of this at https://stats.stackexchange.com/questions/133656/how-to-unde...

Thank you!
The elbow method exists in many fields. It essentially boils down to "hope that an obviously good choice exists"