| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by janalsncm 551 days ago
	Everyone’s experience is different but I’ve been in dozens of MLE interviews (some of which I passed!) and have never once been asked to explain the internals of an optimizer. The interviews were all post 2020, though. Unless someone had a very good reason I would consider it weird to use anything other than AdamW. The compute you could save on a slightly better optimizer pale in comparison to the time you will spend debugging an opaque training bug.

1 comments

For example, if it is meaningful to use large batch sizes, the gradient variance will be lower and adam could be equivalent to just momentum.

As a model is trained, the gradient variance typically falls.

Those optimizers all work to reduce the variance of the updates in various ways.