|
|
|
|
|
by sighingnow
982 days ago
|
|
The pipedream-2bw paper[1] and the zero-offload paper[2] both show that 1-step delayed asynchronous gradient update doesn't affect the convergence (and perplexity) while improve the training efficiency (by fully utilize the bubbles in pipeline parallelism) at a large margin. However, both the Megatron-LM[3] and the DeepSpeed[4] don't use pipedream-2bw scheduling. Could anyone share me some insights or ideas about why such an efficient scheduling scheme doesn't get popular in the LLM pretraining community? Does it suffer convergence/accuracy issue in practice? Or are there any other concerns that blocking it become the default / most popular pipeline parallelism scheduling? [1]: https://arxiv.org/abs/2006.09503 [2]: https://arxiv.org/abs/2101.06840 [3]: https://github.com/nvidia/Megatron-LM [4]: https://github.com/microsoft/DeepSpeed/issues/1110 |
|