| HN Mirror

Not just MPI over a network. We can compress floats, send them over NVLink or PCIe to another GPU in the same host, and decompress and it can be faster than sending data raw between GPUs, that's the premise behind dietgpu even (it's cheap compression, not a great compression ratio, like 0.6-0.9x of original size, but it's extremely fast, 100s of GB/s throughput, with the idea that you're trying to race something that is similarly as fast. General floating point data could be quite incompressible or highly compressible, it really just depends upon what is being passed around).

The interconnects are improving at a slower rate in general than compute on the CPU/GPU is and it can be exploited.