Hacker News new | ask | show | jobs
by tartakovsky 1110 days ago
What’s better, train a model with 10X parameters once on some default hyperparameter setting or to search for a good hyperparameter configuration by training on X parameters 10 times? While I’m at it, how many LLMs of the size of GPT3 were trained until they landed on the capability of GPT3? How much of this is dependent on the data, or do good settings transcend the type of text that a model is trying to train on?
3 comments

The proposed idea here is different.

You can train several smaller models with different hyperparameters with dynamic budgets, i.e. bad configurations are trained for only few epochs, and good ones for more epochs. Once you find a good hyperparameter configuration for the small-scale model, then you train the large model with that configuration.

What is being shown is that the overhead of doing hyperparameter optimization at a small scale, is comparable to a single optimization at the largest scale.

Overall, the idea looks very cool.

30 and 40B parameter models regularly crush GPT-3 175B (Davinci) on every benchmark. GPT-3.5 is probably a 13B parameter model and it beats the original GPT-3 175B on most benchmarks (but not the more recent finetunes like 175B Davinci-003), so hyperparameters are clearly very important.
How are hyperparameters tuned for GPT3.5, is there any leak on the method they use?
The questions you raise are very interesting. My question would be, where does the default hyperparameter configuration come from? Additionally, does there exist one hyperparameter configuration that performs well on all tasks?