|
|
|
|
|
by sebzim4500
1291 days ago
|
|
I'm far from an expert in this field, but based on my conversations with people who are I think this is getting less true. Normally these models are trained with straightforward optimizers (basically naive SGD) since advances like batch normalization and residual connections make the more fancy stuff unnecessary. I think the learning rate schedules used for these big networks tend to be simple as well, just two or three steps. |
|
See this logbook from training the GPT-3 sized OPT model - https://github.com/facebookresearch/metaseq/blob/main/projec...