|
|
|
|
|
by thangngoc89
479 days ago
|
|
The bottleneck on distributed GPUs training/inference is the inter-GPU connections speed. For a single node, it's doable because it utilized PCIe 4.0 connections. For a cluster, you need at least 50Gbps connection between nodes, which is expensive for cheap GPUs. |
|
According to the author of Exo https://blog.exolabs.net/day-1/:
> When Shard A finishes processing its layers, it produces an activation that gets passed to Shard B over whatever network connection is available. In general these activations are actually quite small - for Llama 3.2 3B they are less than 4KB. They scale approximately linearly with the size of the layers. Therefore the bottleneck here is generally the latency between devices, not the bandwidth (a common misconception).
I think that makes sense because the activations are the numbers coming out of the whole neuron network (or part of it). Compared to the number of parameters, it's not at the same magnitude.