|
|
|
|
|
by fnl
3615 days ago
|
|
Hi there, very nice work and thanks for sharing/open sourcing! My questions to you would be: 1. The main problem with k-means is it's scalability [1]. Could you comment on what k and n g you'd expect your implementation to compute? 2. (kd-, ...) Trees with blacklisting (Pelleg-Moore) are a more efficient algorithmic approach to k-means clustering than k-means++ and consorts (which you are showing in the benchmarks on your blog). Do you know how yinyang fares against a correct Pelleg-Moore implementation of k-means (as the yinyang algorithm's authors did not... :-( )? [1] http://www.eecs.tufts.edu/~dsculley/papers/fastkmeans.pdf |
|
2. No, unfortunately I don't. Implementing a tree on GPU is a real pain and the performance will still be bad, so I didn't even consider them.
The common problem with advanced approaches is the memory overhead. Hi-end GPU has only 12 GB and you have to fit. E.g. Yinyang becomes unapplicable on 500000 samples, 10000 clusters and 480 dimensions (though only the product of samples and clusters matters much).