| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by yaroslavvb 2750 days ago
	This kind of scaling was rigorously shown for a related metric called "gradient diversity" in https://arxiv.org/abs/1706.05699

1 comments

samsamoa 2750 days ago

One of the authors here. Thanks for the comment! Yes, we mention this work and number of others in the blogpost and paper. This isn't the first (or the last) paper on the topic but I think we've clarified the large-batch training situation significantly by connecting gradient noise directly to the speed of training, and by measuring it systematically on a bunch of ML tasks and characterizing its behavior.

link

TTPrograms 2750 days ago

Similarly I think the theory developed in the Three Factors paper predicts this scaling law, might be worth citing: https://arxiv.org/abs/1711.04623

link