|
|
|
|
|
by janalsncm
504 days ago
|
|
Everyone’s experience is different but I’ve been in dozens of MLE interviews (some of which I passed!) and have never once been asked to explain the internals of an optimizer. The interviews were all post 2020, though. Unless someone had a very good reason I would consider it weird to use anything other than AdamW. The compute you could save on a slightly better optimizer pale in comparison to the time you will spend debugging an opaque training bug. |
|
As a model is trained, the gradient variance typically falls.
Those optimizers all work to reduce the variance of the updates in various ways.