Y
Hacker News
new
|
ask
|
show
|
jobs
by
gwern
1924 days ago
SGD doesn't work on large Transformers, no. You need something like AdamW.
1 comments
The_rationalist
1924 days ago
Mish is generally superior to RadamW
https://lessw.medium.com/meet-mish-new-state-of-the-art-ai-a...
link
dron57
1923 days ago
That's cool, but Mish is an activation function while SGD and AdamW are optimizers. Apples and oranges.
link