Hacker News new | ask | show | jobs
by vladf 1919 days ago
Alternatively, one could get rid of the memory used by optimizers entirely by switching to vanilla SGD.

I haven’t tried this on transformers and maybe that’s what breaks down here but in “classic” supervised settings I’ve found SGD with schedule tuning just as fast as Adam.

1 comments

SGD doesn't work on large Transformers, no. You need something like AdamW.
That's cool, but Mish is an activation function while SGD and AdamW are optimizers. Apples and oranges.