Y
Hacker News
new
|
ask
|
show
|
jobs
by
liuliu
1919 days ago
They offload parameters, gradients and optimizer states (such as moment, velocity and exponential avg of these in Adam) into CPU memory.
1 comments
p1esk
1919 days ago
They did all that before:
https://arxiv.org/abs/2101.06840
, but they could only fit a model with 13B weights on a single V100.
link