|
|
|
|
|
by simonster
1155 days ago
|
|
For some reason, this article refers to the Chinchilla scaling laws as "data-optimal scaling laws." They are actually scaling laws that describe how to train the best model at a given computational cost, assuming that both the model size and the amount of data on which the model can be trained are constrained only by the amount of compute available. You can get an equally good model with less data if you make the model bigger, but such a model would require more compute to train than the compute-optimal model. It may also be possible to repeat the training set during training and get most of the benefits of training on more data as long as it isn't repeated too many times; this is a common thing to do in other subfields of ML but for LLMs the effect of doing so is not well-characterized. |
|
- best score - don't care about efficiencies (GPT3, GPT4)
- best score for a fixed quantity of compute at training time - good for PhD's and people who make proofs-of-concept (Chinchilla)
- best score for a fixed quantity of compute at inference time - good for people who inference their models at scale (LLaMA, chatGPT turbo)
The article didn't mention the LLaMA scaling laws, where we use more than 20 tokens per weight, more precisely 142 tokens per weight for LLaMA 7B.
> The objective of the scaling laws from Chinchilla is to determine how to best scale the dataset and model sizes for a particular training compute budget. However, this objective disregards the inference budget, which becomes critical when serving a language model at scale. In this context, given a target level of performance, the preferred model is not the fastest to train but the fastest at inference, and although it may be cheaper to train a large model to reach a certain level of performance, a smaller one trained longer will ultimately be cheaper at inference. For instance, although Chinchilla recommends training a 10B model on 200B tokens, we find that the performance of a 7B model continues to improve even after 1T tokens.
What we care about is the best model we could run on our own hardware, not how efficient was its training, that doesn't cost us users anything.