Hacker News new | ask | show | jobs
by fnl 3614 days ago
Thanks for explaining! So I conclude for the data sizes you mention yinyang on a GPU is possibly the best approach, after which Pelleg-Moore on CPUs is (still) the goto solution. Or can you see a way for distributing this among graphics cards?
1 comments

I am working on the multi-gpu branch at the moment, but the memory constraints will remain the same (optimizing for the speed at this time). I do see the way to distribute memory across cards though, and it will be the next step. So yes, if your problem size fits Yinyang, then kmcuda looks like an optimal choice. If it is bigger, the best way is to make, say first 5 iterations with kmcuda in Lloyd mode and then pass the half-baked centroids to some Pelleg-Moore implementation on CPU.