Hacker News new | ask | show | jobs
by andreyk 1292 days ago
I work in this field (PhD candidate), and what you say is true for smaller models, but not GPT-3 scale models. Training large scale models involved a lot more, as the OP said. It's not just learning rate schedulers, it's a whole bunch of stuff.

See this logbook from training the GPT-3 sized OPT model - https://github.com/facebookresearch/metaseq/blob/main/projec...

3 comments

Seems like majority of problems in this log are devops problems, which seems to be combination of ML people doing devops work while not having experience with devops work and really bad cloud vendor. I've been running multiple bare metal nodes with 8 GPUs each running 24/7 for months with almost 100% utilization and had 100x less problems than they had.
it is neither as simple as the person you are responding to, nor as complicated as you make it seem. it will only get simpler with time.
so creating each new rev of GPT3 would involve going through something like all those messy steps in that logbook?