Hacker News new | ask | show | jobs
by dartos 843 days ago
Well doesn’t the compute time for transformers scale roughly quadratically with model size?

Would it make sense for power consumption to also scale roughly quadratically?

1 comments

I'm not sure. The figures I've seen suggest that GPT3 required 10x more energy to train than GPT2 (e.g. https://www.nnlabs.org/power-requirements-of-large-language-....), so I think a roughly 1-2 order of magnitude increase in energy usage from GPT2 to GPT3.5 makes sense.