|
Google put out the Chinchilla paper last year, showing that GPT-3 and others could have gotten better at the same size by just shoving more tokens at them in further training loops. The paper showed some snazzy curves where more training time and data equalled better quality, and speculated / demonstrated that a lot more training tokens and time could get better quality out of smaller models than GPT-'s 175B. The was, for a minute, ignored, because the PaLM paper came out very shortly thereafter which seemed to show, pretty conclusively, that there are unusual and exciting emergent behaviours coming out of much larger models, (PaLM is 540B parameters), and so that was hotter news. In the meantime, some really smart folks looked at the Chinchilla curve, and were like "hmm. One way to think about this is to see that if you are willing to put a LOT more compute in upfront on a model, then the inference costs go down in some sub-linear function." Llama's architectural instincts are that if you're going to give away a model, and it is going to get run on the edge, it might make sense to spend a whole, whole lot of compute, once, training something past what the paper considered optimal, and well into the point where the paper thought of it as "not worth it", precisely because the entire world might be able to run it if you can get something good and much smaller. Conclusively, OPT and LLMs from its era are significantly 'under-trained' compared even to GPT-3, itself undertrained by something like an order of magnitude from where the Chinchilla paper implies they should be. I guess I made up the phrase over and under-trained; their might be some other way to talk about it elsewhere. Sorry! :) |