|
> Almost every team that I’ve been talking to that is training a LLM right now talks about how they’re training a Chinchilla optimal model, which is remarkable given that basically everything in the LLM space changes every week. I hope that either that's a miscommunication, or I'm wrong about how much of a red flag that seems to be. The Chinchilla scaling laws allow you to relate, at a somewhat-better-than-rule-of-thumb level, the model size, training data size, and achieved performance of a LLM, without actually training one. So, if for instance you have a certain loss target, and a certain sized corpus of training data, you can use the scaling law to calculate what size of a model to train to hit the target. I can see that being useful to any team. Chinchilla-optimality on the other hand means finding, for a set loss target, the combination of model size and training data size that minimizes training compute (which, roughly speaking, scales with just the product of those two numbers). But only training compute: Inference compute only scales with model size, regardless of training data. So Chinchilla-optimality is useful only if you expect training to take up most of your compute, i.e. if you are not expecting to actually use the model that much. I'm not in the field myself so I don't know how to quantify "that much", but it's definitely enough to keep those concepts distinct. |
https://finbarr.ca/llms-not-trained-enough/
I think the conversations were partly (largely?) a snapshot in time. I was talking to people in February/March, and all of this was much less thought through at the time. But you’re totally right. You want something like Llama, where you train a smaller model longer than Chinchilla would predict.