Hacker News new | ask | show | jobs
by p1esk 1919 days ago
But it's not clear how they managed to improve training on a single GPU: they say they can fit 40B model on a single V100.
1 comments

They offload parameters, gradients and optimizer states (such as moment, velocity and exponential avg of these in Adam) into CPU memory.
They did all that before: https://arxiv.org/abs/2101.06840, but they could only fit a model with 13B weights on a single V100.