|
|
|
|
|
by vladf
1919 days ago
|
|
Alternatively, one could get rid of the memory used by optimizers entirely by switching to vanilla SGD. I haven’t tried this on transformers and maybe that’s what breaks down here but in “classic” supervised settings I’ve found SGD with schedule tuning just as fast as Adam. |
|