Hacker News new | ask | show | jobs
by nerdponx 1558 days ago
Note also that specifically for one-dimensional data, there is a globally optimal solution to the k-means clustering problem. There is an R package that implements it using a C++ core implementation [1], and also a Python wrapper [2].

This implementation is also surprisingly fast, so you can use it to brute-force check many different numbers of clusters and check using silhouette distance. The advantage over traditional k-means is that you don't need to check multiple initializations for any given number of clusters, because the algorithm is deterministic and guaranteed to find the global optimum.

[1]: https://cran.r-project.org/package=Ckmeans.1d.dp

[2]: https://github.com/djdt/ckwrap

1 comments

Is there a paper or a post which describes the solution?
I just want to mention that these algorithms were known way back and it is sad that the authors of CRAN package do not acknowledge it in their docs:

1. O(NK^2) dynamic programming from 1965 by James Bruce (https://dspace.mit.edu/bitstream/handle/1721.1/4396/RLE-TR-4...)

2. O(KNlogN) dynamic + divide&conquer algorighm from 1989 from Xiaolin Wu and John Rokne (http://doi.acm.org/10.1145/75427.75472)

3. O(NK) algorithm from 1991 by Xiaolin Wu (http://dx.doi.org/10.1016/0196-6774(91)90039-2)

Is there any implementation in Python / Julia / MATLAB?
Yes, the R package was written by the authors of the paper, and the CRAN info page references the paper.