|
|
|
|
|
by pellucide
795 days ago
|
|
From the article >We made several new observations on scaling behavior during the development of Llama 3. For example, while the Chinchilla-optimal amount of training compute for an 8B parameter model corresponds to ~200B tokens, we found that model performance continues to improve even after the model is trained on two orders of magnitude more data. Both our 8B and 70B parameter models continued to improve log-linearly after we trained them on up to 15T tokens. Larger models can match the performance of these smaller models with less training compute, but smaller models are generally preferred because they are much more efficient during inference. Can someone experienced please explain this. Does this mean, a lean model with more training time and/or more (or better) training data will perform better than a fat model? |
|
"Chinchilla-optimal" is about choosing model size and/or dataset size to maximize the accuracy of your model under a fixed training budget (fixed number of floating point operations). For a given dataset size it will tell you the model size to use, and vice versa, again under the assumption of a fixed training budget.
However, what people have realized is that inference compute matters at least as much as training compute. You want to optimize training and inference cost together, not in isolation. Training a smaller model means your accuracy will not be as good as it could have been with a larger model using the same training budget, however you'll more than make it up in your inference budget. So in most real world cases it doesn't make sense to be "Chinchilla-optimal".
What Meta is saying here is that there is no accuracy ceiling. You can keep increasing training budget and dataset size to increase accuracy seemingly indefinitely (with diminishing returns). At least as far as they have explored.