Hacker News new | ask | show | jobs
by jiofih 1919 days ago
Third paragraph or so in the overview:

> ZeRO removes the memory redundancies across data-parallel processes by partitioning the three model states (optimizer states, gradients, and parameters) across data-parallel processes instead of replicating them. By doing this, it boosts memory efficiency compared to classic data-parallelism while retaining its computational granularity and communication efficiency

1 comments

Yeah that would be the techno-babble. I've been working on a machine learning pipeline for 6 years and I still have no idea what this means.
It is mostly applicable to transformer models, the ideas in the paper would be alien if you work on computer vision.

In transformer models, big chunk of memory was parameters, and states for optimizers (because vanilla SGD not used there). The memory optimization technique that removes parameters duplication on each GPU or offload entirely to CPU makes sense.

In computer vision, big chunk of memory was hold by forward layer activations and the memory optimization technique applicable in these cases would be binomial checkpointing.

It doesn't sound like techno-babble to me. They've distributed storage across nodes rather than replicating on each node, hence the model size is now scalable with number of nodes rather than being limited to what could be stored on a single node.
But it's not clear how they managed to improve training on a single GPU: they say they can fit 40B model on a single V100.
They offload parameters, gradients and optimizer states (such as moment, velocity and exponential avg of these in Adam) into CPU memory.
They did all that before: https://arxiv.org/abs/2101.06840, but they could only fit a model with 13B weights on a single V100.
I have a reasonable amount of experience with distributed machine learning (and transformers in particular, too) and I have to 100% agree that this blog post (and even the ZeRO paper) is largely technobabble. I don't doubt that this might really work, but how it works is not elucidated very well, and I'm still not 100% sure I understand what they actually did.

For anyone who still thinks the blog post has substance: Saying that they partitioned the optimizer state, params, etc to have no redundancies is kind of "duh," sort of like saying "we solved the problem using coding and algorithms." It's obvious that we want to eliminate redundancies to maximize the effective VRAM; it's not like nobody thought of not having redundancies before. The problem is that in general, training models distributed is a weird balancing act between redundancy, network usage, compute, etc. The existing methods, model/pipeline/data parallel, gradient checkpoint/accum, etc all have their pros and cons. Unless ZeRO3 is doing something crazy, it has to be giving something up to get to zero redundancy, and knowing what that is would be very important.

If someone could ELI5 how ZeRO actually works, that would be nice.

Having followed this DeepSpeed stuff for a little while, the ZeRO paper is probably as close as you can get to an ELI5 because there's no singular brilliant idea behind this. Most of the ideas have been explored already (see e.g. the PyTorch DDP paper), but ZeRO takes them to their logical conclusion by throwing a TON of engineering work into the equation. For example, they implement custom fused kernels on CPU/GPU and a hand-vectorized Adam implementation.

I found that this earlier blog post [2] has a much better deep dive (with decent animations and more) into the underlying architecture. The ZeRO-Offload paper [3] also has far more detail about that part of the pipeline.

[1] https://arxiv.org/abs/2006.15704 [2] https://www.microsoft.com/en-us/research/blog/deepspeed-extr... [3] https://arxiv.org/abs/2101.06840

My impression from reading the paper is most of the other optimizations (custom kernels, contiguous memory, checkpointing, etc) are orthogonal to the partitioning stuff. That seems to imply that ZeRO is model+pipeline parallel plus a bunch of miscellaneous bits. But they seem to emphasize that this isn't what their partitioning is, and that's the part that perplexes me the most. To be specific, I'd like someone to explain how their magical zero-redundancy data parallel (termed ZeRO-DP in the paper) works and how it's different from from model+pipeline parallel, and their paper is awfully sparse on that.
> My impression from reading the paper is most of the other optimizations (custom kernels, contiguous memory, checkpointing, etc) are orthogonal to the partitioning stuff

This is true, I include them as examples of the amount of engineering work involved because using the partitioning as an example would require recapitulating their blog post :)

> But they seem to emphasize that this isn't what their partitioning is, and that's the part that perplexes me the most. To be specific, I'd like someone to explain how their magical zero-redundancy data parallel (termed ZeRO-DP in the paper) works and how it's different from from model+pipeline parallel, and their paper is awfully sparse on that.

Again, https://www.microsoft.com/en-us/research/blog/deepspeed-extr... is a much better resource on this. There really isn't any magic going on, nor are many of these ideas (checkpointing, model state sharding, bucketing, JIT communication of new states interleaved with compute, etc.) new when considered in isolation. ZeRO is data + model + pipeline parallel, but optimized to the nines and actually usable as a production library.

You can read the paper here: https://arxiv.org/abs/1910.02054
If your pipeline uses only “classic” ml models, then this won’t make too much sense. It’s mostly applicable to NNs.
The product is obviously not for you but for clueless PHBs who want the "latest and best" for the team so those useless ML engineers can finally put his brilliant idea in production with a less than 1% prediction error.