Hacker News new | ask | show | jobs
by BadInformatics 1921 days ago
Having followed this DeepSpeed stuff for a little while, the ZeRO paper is probably as close as you can get to an ELI5 because there's no singular brilliant idea behind this. Most of the ideas have been explored already (see e.g. the PyTorch DDP paper), but ZeRO takes them to their logical conclusion by throwing a TON of engineering work into the equation. For example, they implement custom fused kernels on CPU/GPU and a hand-vectorized Adam implementation.

I found that this earlier blog post [2] has a much better deep dive (with decent animations and more) into the underlying architecture. The ZeRO-Offload paper [3] also has far more detail about that part of the pipeline.

[1] https://arxiv.org/abs/2006.15704 [2] https://www.microsoft.com/en-us/research/blog/deepspeed-extr... [3] https://arxiv.org/abs/2101.06840

1 comments

My impression from reading the paper is most of the other optimizations (custom kernels, contiguous memory, checkpointing, etc) are orthogonal to the partitioning stuff. That seems to imply that ZeRO is model+pipeline parallel plus a bunch of miscellaneous bits. But they seem to emphasize that this isn't what their partitioning is, and that's the part that perplexes me the most. To be specific, I'd like someone to explain how their magical zero-redundancy data parallel (termed ZeRO-DP in the paper) works and how it's different from from model+pipeline parallel, and their paper is awfully sparse on that.
> My impression from reading the paper is most of the other optimizations (custom kernels, contiguous memory, checkpointing, etc) are orthogonal to the partitioning stuff

This is true, I include them as examples of the amount of engineering work involved because using the partitioning as an example would require recapitulating their blog post :)

> But they seem to emphasize that this isn't what their partitioning is, and that's the part that perplexes me the most. To be specific, I'd like someone to explain how their magical zero-redundancy data parallel (termed ZeRO-DP in the paper) works and how it's different from from model+pipeline parallel, and their paper is awfully sparse on that.

Again, https://www.microsoft.com/en-us/research/blog/deepspeed-extr... is a much better resource on this. There really isn't any magic going on, nor are many of these ideas (checkpointing, model state sharding, bucketing, JIT communication of new states interleaved with compute, etc.) new when considered in isolation. ZeRO is data + model + pipeline parallel, but optimized to the nines and actually usable as a production library.