|
|
|
|
|
by BadInformatics
1921 days ago
|
|
Having followed this DeepSpeed stuff for a little while, the ZeRO paper is probably as close as you can get to an ELI5 because there's no singular brilliant idea behind this. Most of the ideas have been explored already (see e.g. the PyTorch DDP paper), but ZeRO takes them to their logical conclusion by throwing a TON of engineering work into the equation. For example, they implement custom fused kernels on CPU/GPU and a hand-vectorized Adam implementation. I found that this earlier blog post [2] has a much better deep dive (with decent animations and more) into the underlying architecture. The ZeRO-Offload paper [3] also has far more detail about that part of the pipeline. [1] https://arxiv.org/abs/2006.15704
[2] https://www.microsoft.com/en-us/research/blog/deepspeed-extr...
[3] https://arxiv.org/abs/2101.06840 |
|