|
|
|
|
|
by alecco
504 days ago
|
|
They also did bandwidth scaling to handle work around the nerfed H800 interconnects. > efficient cross-node all-to-all communication kernels to fully utilize IB and NVLink bandwidths > The key idea of DualPipe is to overlap the computation and communication within a pair of individual forward and backward chunks. To be specific, we divide each chunk into four components: attention, all-to-all dispatch, MLP, and all-to-all combine. Specially, for a backward chunk, both attention and MLP are further split into two parts, backward for input and backward for weights, like in ZeroBubble (Qi et al., 2023b). In addition, we have a PP communication component. (I know some of those words) https://arxiv.org/html/2412.19437v1 |
|