Hacker News new | ask | show | jobs
by sebzim4500 1291 days ago
I'm far from an expert in this field, but based on my conversations with people who are I think this is getting less true. Normally these models are trained with straightforward optimizers (basically naive SGD) since advances like batch normalization and residual connections make the more fancy stuff unnecessary. I think the learning rate schedules used for these big networks tend to be simple as well, just two or three steps.
1 comments

I work in this field (PhD candidate), and what you say is true for smaller models, but not GPT-3 scale models. Training large scale models involved a lot more, as the OP said. It's not just learning rate schedulers, it's a whole bunch of stuff.

See this logbook from training the GPT-3 sized OPT model - https://github.com/facebookresearch/metaseq/blob/main/projec...

Seems like majority of problems in this log are devops problems, which seems to be combination of ML people doing devops work while not having experience with devops work and really bad cloud vendor. I've been running multiple bare metal nodes with 8 GPUs each running 24/7 for months with almost 100% utilization and had 100x less problems than they had.
it is neither as simple as the person you are responding to, nor as complicated as you make it seem. it will only get simpler with time.
so creating each new rev of GPT3 would involve going through something like all those messy steps in that logbook?