Hacker News new | ask | show | jobs
by simonster 1155 days ago
For some reason, this article refers to the Chinchilla scaling laws as "data-optimal scaling laws." They are actually scaling laws that describe how to train the best model at a given computational cost, assuming that both the model size and the amount of data on which the model can be trained are constrained only by the amount of compute available. You can get an equally good model with less data if you make the model bigger, but such a model would require more compute to train than the compute-optimal model. It may also be possible to repeat the training set during training and get most of the benefits of training on more data as long as it isn't repeated too many times; this is a common thing to do in other subfields of ML but for LLMs the effect of doing so is not well-characterized.
2 comments

There are three ways to train:

- best score - don't care about efficiencies (GPT3, GPT4)

- best score for a fixed quantity of compute at training time - good for PhD's and people who make proofs-of-concept (Chinchilla)

- best score for a fixed quantity of compute at inference time - good for people who inference their models at scale (LLaMA, chatGPT turbo)

The article didn't mention the LLaMA scaling laws, where we use more than 20 tokens per weight, more precisely 142 tokens per weight for LLaMA 7B.

> The objective of the scaling laws from Chinchilla is to determine how to best scale the dataset and model sizes for a particular training compute budget. However, this objective disregards the inference budget, which becomes critical when serving a language model at scale. In this context, given a target level of performance, the preferred model is not the fastest to train but the fastest at inference, and although it may be cheaper to train a large model to reach a certain level of performance, a smaller one trained longer will ultimately be cheaper at inference. For instance, although Chinchilla recommends training a 10B model on 200B tokens, we find that the performance of a 7B model continues to improve even after 1T tokens.

What we care about is the best model we could run on our own hardware, not how efficient was its training, that doesn't cost us users anything.

> What we care about is the best model we could run on our own hardware, not how efficient was its training, that doesn't cost us users anything.

Training cost is a huge (but diffuse) cost because it limits which organizations can be train a model. Such concentration of power greatly affect users.

Even if you want to get best score overall, Chinchilla laws still apply. Any model is trained on finite amount of compute, and there is optimal (in a sense of minimal loss) model size for this amount of compute. So difference between 1 and 2 is only amount of compute basically.

As for inference if you want just bound from above possible model size, then just take largest model you can allow and train for as long as possible. There is no evidence (yet) that we can hit the ceiling with this one.

> people who inference their models at scale

What does "inferencing models at scale" mean?

Actually using the model, instead of doing inference a few times when writing your paper and then that's it.
Optimizing it to be cheaper to run (e.g. OpenAI saved a lot of money with the turbo variant that ChatGPT uses, relative to the original GPT-3 models).
Running server farms to quickly give answers to users, like bing or ChatGPT.

The models produced have to be fast and efficient to support capacity/cost, which has some detriment to quality/accuracy.

The repeated dataset regime is better characterized than people think, see Meta's Galactica paper: https://arxiv.org/abs/2211.09085