| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by ansk 1924 days ago
	Question for someone knowledgable about this: if I have a model which is large -- but small enough that I can fit a single training example on GPU -- does this approach offer speedups compared to simple gradient accumulation? Or is this only useful for models which are so large that the model parameters themselves are overwhelming GPU memory?