|
|
|
|
|
by leogao
1919 days ago
|
|
I have a reasonable amount of experience with distributed machine learning (and transformers in particular, too) and I have to 100% agree that this blog post (and even the ZeRO paper) is largely technobabble. I don't doubt that this might really work, but how it works is not elucidated very well, and I'm still not 100% sure I understand what they actually did. For anyone who still thinks the blog post has substance: Saying that they partitioned the optimizer state, params, etc to have no redundancies is kind of "duh," sort of like saying "we solved the problem using coding and algorithms." It's obvious that we want to eliminate redundancies to maximize the effective VRAM; it's not like nobody thought of not having redundancies before. The problem is that in general, training models distributed is a weird balancing act between redundancy, network usage, compute, etc. The existing methods, model/pipeline/data parallel, gradient checkpoint/accum, etc all have their pros and cons. Unless ZeRO3 is doing something crazy, it has to be giving something up to get to zero redundancy, and knowing what that is would be very important. If someone could ELI5 how ZeRO actually works, that would be nice. |
|
I found that this earlier blog post [2] has a much better deep dive (with decent animations and more) into the underlying architecture. The ZeRO-Offload paper [3] also has far more detail about that part of the pipeline.
[1] https://arxiv.org/abs/2006.15704 [2] https://www.microsoft.com/en-us/research/blog/deepspeed-extr... [3] https://arxiv.org/abs/2101.06840