| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by p1esk 1966 days ago
	But it's not clear how they managed to improve training on a single GPU: they say they can fit 40B model on a single V100.

1 comments

They offload parameters, gradients and optimizer states (such as moment, velocity and exponential avg of these in Adam) into CPU memory.

They did all that before: https://arxiv.org/abs/2101.06840, but they could only fit a model with 13B weights on a single V100.