|
|
|
|
|
by modeless
795 days ago
|
|
Yes. Llama 3 8B outperforms Llama 2 70B (in the instruct-tuned variants). "Chinchilla-optimal" is about choosing model size and/or dataset size to maximize the accuracy of your model under a fixed training budget (fixed number of floating point operations). For a given dataset size it will tell you the model size to use, and vice versa, again under the assumption of a fixed training budget. However, what people have realized is that inference compute matters at least as much as training compute. You want to optimize training and inference cost together, not in isolation. Training a smaller model means your accuracy will not be as good as it could have been with a larger model using the same training budget, however you'll more than make it up in your inference budget. So in most real world cases it doesn't make sense to be "Chinchilla-optimal". What Meta is saying here is that there is no accuracy ceiling. You can keep increasing training budget and dataset size to increase accuracy seemingly indefinitely (with diminishing returns). At least as far as they have explored. |
|
Meta have a massive user base, and if they are using these models to run their own business, then that implies massive inference volume, and that it might make economic sense for them to put more money into training (to make smaller/cheaper models more powerful) than for other companies with lower inference volume.
To put it another way, it'd not be surprising - if their internal use of these models is very high - to see Meta continuing to release models that size for size beat the competition since they were incentivized to pump more tokens through them during training.