|
|
|
|
|
by tartakovsky
1110 days ago
|
|
What’s better, train a model with 10X parameters once on some default hyperparameter setting or to search for a good hyperparameter configuration by training on X parameters 10 times? While I’m at it, how many LLMs of the size of GPT3 were trained until they landed on the capability of GPT3? How much of this is dependent on the data, or do good settings transcend the type of text that a model is trying to train on? |
|
You can train several smaller models with different hyperparameters with dynamic budgets, i.e. bad configurations are trained for only few epochs, and good ones for more epochs. Once you find a good hyperparameter configuration for the small-scale model, then you train the large model with that configuration.
What is being shown is that the overhead of doing hyperparameter optimization at a small scale, is comparable to a single optimization at the largest scale.
Overall, the idea looks very cool.