| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by yobbo 507 days ago

For example, if it is meaningful to use large batch sizes, the gradient variance will be lower and adam could be equivalent to just momentum.

As a model is trained, the gradient variance typically falls.

Those optimizers all work to reduce the variance of the updates in various ways.