Hacker News new | ask | show | jobs
by zachthewf 1919 days ago
It doesn't sound like techno-babble to me. They've distributed storage across nodes rather than replicating on each node, hence the model size is now scalable with number of nodes rather than being limited to what could be stored on a single node.
1 comments

But it's not clear how they managed to improve training on a single GPU: they say they can fit 40B model on a single V100.
They offload parameters, gradients and optimizer states (such as moment, velocity and exponential avg of these in Adam) into CPU memory.
They did all that before: https://arxiv.org/abs/2101.06840, but they could only fit a model with 13B weights on a single V100.