Hacker News new | ask | show | jobs
by liuliu 1919 days ago
They offload parameters, gradients and optimizer states (such as moment, velocity and exponential avg of these in Adam) into CPU memory.
1 comments

They did all that before: https://arxiv.org/abs/2101.06840, but they could only fit a model with 13B weights on a single V100.