| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by pellucide 795 days ago

From the article

>We made several new observations on scaling behavior during the development of Llama 3. For example, while the Chinchilla-optimal amount of training compute for an 8B parameter model corresponds to ~200B tokens, we found that model performance continues to improve even after the model is trained on two orders of magnitude more data. Both our 8B and 70B parameter models continued to improve log-linearly after we trained them on up to 15T tokens. Larger models can match the performance of these smaller models with less training compute, but smaller models are generally preferred because they are much more efficient during inference.

Can someone experienced please explain this. Does this mean, a lean model with more training time and/or more (or better) training data will perform better than a fat model?

2 comments

modeless 795 days ago

Yes. Llama 3 8B outperforms Llama 2 70B (in the instruct-tuned variants).

"Chinchilla-optimal" is about choosing model size and/or dataset size to maximize the accuracy of your model under a fixed training budget (fixed number of floating point operations). For a given dataset size it will tell you the model size to use, and vice versa, again under the assumption of a fixed training budget.

However, what people have realized is that inference compute matters at least as much as training compute. You want to optimize training and inference cost together, not in isolation. Training a smaller model means your accuracy will not be as good as it could have been with a larger model using the same training budget, however you'll more than make it up in your inference budget. So in most real world cases it doesn't make sense to be "Chinchilla-optimal".

What Meta is saying here is that there is no accuracy ceiling. You can keep increasing training budget and dataset size to increase accuracy seemingly indefinitely (with diminishing returns). At least as far as they have explored.

link

HarHarVeryFunny 795 days ago

What's interesting about the minimization of combined training + (model lifetime) inference cost is that that is going to look different for different companies, depending on what their inference volume is...

Meta have a massive user base, and if they are using these models to run their own business, then that implies massive inference volume, and that it might make economic sense for them to put more money into training (to make smaller/cheaper models more powerful) than for other companies with lower inference volume.

To put it another way, it'd not be surprising - if their internal use of these models is very high - to see Meta continuing to release models that size for size beat the competition since they were incentivized to pump more tokens through them during training.

link

greatpostman 795 days ago

Huge resources are being spent on these models at meta. Some very interesting software will come out of there in the next decade

link

pellucide 795 days ago

Somewhere I read that the 8B llama2 model could be undertrained by 100-1000x. So is it possible to train a model with 8B/100 = 80M parameters to perform as good as the llama2 8B model, given enough training time and training tokens?

link

modeless 795 days ago

It's unclear. It might take a larger dataset than actually exists, or more compute than is practical. Or there may be a limit that we just haven't reached yet; this actually seems quite likely. The scaling "laws" are really more like guidelines and they are likely wrong when extrapolated too far.

link

pellucide 795 days ago

Thanks!

link

hnav 795 days ago

They're saying with this architecture there's a tradeoff between training and inference cost where a 10x smaller model (much cheaper to run inference) can match a bigger model if the smaller is trained on 100x data (much more expensive to train) and that the improvement continues log-linearly.

link