Hacker News new | ask | show | jobs
by janalsncm 504 days ago
Everyone’s experience is different but I’ve been in dozens of MLE interviews (some of which I passed!) and have never once been asked to explain the internals of an optimizer. The interviews were all post 2020, though.

Unless someone had a very good reason I would consider it weird to use anything other than AdamW. The compute you could save on a slightly better optimizer pale in comparison to the time you will spend debugging an opaque training bug.

1 comments

For example, if it is meaningful to use large batch sizes, the gradient variance will be lower and adam could be equivalent to just momentum.

As a model is trained, the gradient variance typically falls.

Those optimizers all work to reduce the variance of the updates in various ways.