Hacker News new | ask | show | jobs
by desku 2908 days ago
I'm not sure how to interpret the argument of the article or the results in the appendix here.

The first table shows AdamW having the best results, which follows the argument of the article. However, the following three tables all have plain Adam producing the best results.

The way the article is written it seems to be championing AdamW, but the results just seem to conclude that AMSGrad is bad and Adam is the best with AdamW having negligible performance increase over Adam in a single task.

1 comments

From the article:

So, weight decay is always better than L2 regularization with Adam then? We haven’t found a situation where it’s significantly worse, but for either a transfer-learning problem (e.g. fine-tuning Resnet50 on Stanford cars) or RNNs, it didn’t give better results.