| HN Mirror

For training, yes, you will need to share the parameters (i.e., weights and bias); the number is huge. But for inference, you don't need that much high bandwidth to run it in a distributed manner.

According to the author of Exo https://blog.exolabs.net/day-1/:

> When Shard A finishes processing its layers, it produces an activation that gets passed to Shard B over whatever network connection is available. In general these activations are actually quite small - for Llama 3.2 3B they are less than 4KB. They scale approximately linearly with the size of the layers. Therefore the bottleneck here is generally the latency between devices, not the bandwidth (a common misconception).

I think that makes sense because the activations are the numbers coming out of the whole neuron network (or part of it). Compared to the number of parameters, it's not at the same magnitude.