| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by visarga 1202 days ago

There are three ways to train:

- best score - don't care about efficiencies (GPT3, GPT4)

- best score for a fixed quantity of compute at training time - good for PhD's and people who make proofs-of-concept (Chinchilla)

- best score for a fixed quantity of compute at inference time - good for people who inference their models at scale (LLaMA, chatGPT turbo)

The article didn't mention the LLaMA scaling laws, where we use more than 20 tokens per weight, more precisely 142 tokens per weight for LLaMA 7B.

> The objective of the scaling laws from Chinchilla is to determine how to best scale the dataset and model sizes for a particular training compute budget. However, this objective disregards the inference budget, which becomes critical when serving a language model at scale. In this context, given a target level of performance, the preferred model is not the fastest to train but the fastest at inference, and although it may be cheaper to train a large model to reach a certain level of performance, a smaller one trained longer will ultimately be cheaper at inference. For instance, although Chinchilla recommends training a 10B model on 200B tokens, we find that the performance of a 7B model continues to improve even after 1T tokens.

What we care about is the best model we could run on our own hardware, not how efficient was its training, that doesn't cost us users anything.

3 comments

nextaccountic 1202 days ago

> What we care about is the best model we could run on our own hardware, not how efficient was its training, that doesn't cost us users anything.

Training cost is a huge (but diffuse) cost because it limits which organizations can be train a model. Such concentration of power greatly affect users.

link

EvgeniyZh 1201 days ago

Even if you want to get best score overall, Chinchilla laws still apply. Any model is trained on finite amount of compute, and there is optimal (in a sense of minimal loss) model size for this amount of compute. So difference between 1 and 2 is only amount of compute basically.

As for inference if you want just bound from above possible model size, then just take largest model you can allow and train for as long as possible. There is no evidence (yet) that we can hit the ceiling with this one.

link

esperent 1202 days ago

> people who inference their models at scale

What does "inferencing models at scale" mean?

link

cypress66 1202 days ago

Actually using the model, instead of doing inference a few times when writing your paper and then that's it.

link

andai 1202 days ago

Optimizing it to be cheaper to run (e.g. OpenAI saved a lot of money with the turbo variant that ChatGPT uses, relative to the original GPT-3 models).

link

Closi 1202 days ago

Running server farms to quickly give answers to users, like bing or ChatGPT.

The models produced have to be fast and efficient to support capacity/cost, which has some detriment to quality/accuracy.

link