| HN Mirror

I am working on the multi-gpu branch at the moment, but the memory constraints will remain the same (optimizing for the speed at this time). I do see the way to distribute memory across cards though, and it will be the next step. So yes, if your problem size fits Yinyang, then kmcuda looks like an optimal choice. If it is bigger, the best way is to make, say first 5 iterations with kmcuda in Lloyd mode and then pass the half-baked centroids to some Pelleg-Moore implementation on CPU.