Y
Hacker News
new
|
ask
|
show
|
jobs
by
p1esk
1919 days ago
But it's not clear how they managed to improve training on a single GPU: they say they can fit 40B model on a single V100.
1 comments
liuliu
1919 days ago
They offload parameters, gradients and optimizer states (such as moment, velocity and exponential avg of these in Adam) into CPU memory.
link
p1esk
1919 days ago
They did all that before:
https://arxiv.org/abs/2101.06840
, but they could only fit a model with 13B weights on a single V100.
link