| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by vladf 1966 days ago
	Alternatively, one could get rid of the memory used by optimizers entirely by switching to vanilla SGD. I haven’t tried this on transformers and maybe that’s what breaks down here but in “classic” supervised settings I’ve found SGD with schedule tuning just as fast as Adam.

1 comments

SGD doesn't work on large Transformers, no. You need something like AdamW.

That's cool, but Mish is an activation function while SGD and AdamW are optimizers. Apples and oranges.