| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by fnl 3615 days ago

Hi there, very nice work and thanks for sharing/open sourcing!

My questions to you would be:

1. The main problem with k-means is it's scalability [1]. Could you comment on what k and n g you'd expect your implementation to compute?

2. (kd-, ...) Trees with blacklisting (Pelleg-Moore) are a more efficient algorithmic approach to k-means clustering than k-means++ and consorts (which you are showing in the benchmarks on your blog). Do you know how yinyang fares against a correct Pelleg-Moore implementation of k-means (as the yinyang algorithm's authors did not... :-( )?

[1] http://www.eecs.tufts.edu/~dsculley/papers/fastkmeans.pdf

1 comments

vmarkovtsev 3615 days ago

1. The number of samples must not exceed UINT32_MAX, that is, 4*10^9. The number of clusters must not exceed UINT32_MAX too. Number of dimensions must not exceed 12288 (GPU shared memory constraint). We successfully tested with 4M samples and 480 dimensions (the product is greater than UINT32_MAX) against potential overflows. Practically speaking, it will not take ages if the product of samples, clusters and dimensions does not exceed 10^14.

2. No, unfortunately I don't. Implementing a tree on GPU is a real pain and the performance will still be bad, so I didn't even consider them.

The common problem with advanced approaches is the memory overhead. Hi-end GPU has only 12 GB and you have to fit. E.g. Yinyang becomes unapplicable on 500000 samples, 10000 clusters and 480 dimensions (though only the product of samples and clusters matters much).

link

fnl 3615 days ago

Thanks for explaining! So I conclude for the data sizes you mention yinyang on a GPU is possibly the best approach, after which Pelleg-Moore on CPUs is (still) the goto solution. Or can you see a way for distributing this among graphics cards?

link

vmarkovtsev 3615 days ago

I am working on the multi-gpu branch at the moment, but the memory constraints will remain the same (optimizing for the speed at this time). I do see the way to distribute memory across cards though, and it will be the next step. So yes, if your problem size fits Yinyang, then kmcuda looks like an optimal choice. If it is bigger, the best way is to make, say first 5 iterations with kmcuda in Lloyd mode and then pass the half-baked centroids to some Pelleg-Moore implementation on CPU.

link