Note the model is trained on data generated by GPT-4. It's probably orders of magnitude more expensive to generate the data at current API prices.
The whole point of these papers is that training data quality is key.
I would much prefer for these companies to release the training data than the weights. But that will never happen.
"We speculate that the creation of synthetic datasets will become, in the near future, an important technical skill and a central topic of research in AI."
Yes, I think we are seeing the beginning of a feedback loop where we can use current LLMs to generate better datasets at a scale large enough to create new LLMs. This is the positive feedback loop that I think is going to make the biggest difference in model quality over the next few years.
Would it really be a "feedback loop"? I can see how the technique will enable small LLM's to emulate the quality of large LLM's. Though I fail to see how training on the output of a large LLM would ever produce something of superior quality to that LLM itself.
Think of astronomy. The first generation of astronomers learns only by observing the night sky. The second generation learns by observing the night sky and also reading the books written by the first generation.
Wouldn't you expect the n^th generation to understand more about astronomy than the first? And maybe from a smaller amount of input - they might make relatively few observations of their own, mainly relying on the books written by the previous generation.
But isn't the comparison you're making that the second (and following) sets of astronomers only study the books of the first ones, and not the night sky itself?
There is probably some limit where making the dataset larger, with more diverse information, does not create meaningful improvements with current architectures. I do not know what that limit is or what it looks like, but I also don’t think we are particularly close to it yet.
“The Pile” dataset is the asset we needed to jumpstart this process, it had so much raw data it could get us over the hump, but Phi and some of the models trained on explicit reasoning make the limitations of random shit people say on the internet pretty clear.
I'm bullish on domain specific models that start from generalized models. Something of a T shape analogy, but maybe a couple of distillation & fine-tuning steps
I disagree with this. If you give GPT information that was not part of its dataset and ask it to make question and answer pairs off of that information, you are adding higher quality breadth to the training corpus.
> Note the model is trained on data generated by GPT-4.
Is it? I couldn't find that in the page, and can't easily access the links. The previous paper used 1B tokens from GPT-3.5
> It's probably orders of magnitude more expensive to generate the data at current API prices.
If you're generating a billion tokens, you might do better with dedicated instances, iirc they used to say if you were doing more than a few hundred million a month dedicated things were cheaper.
I might be missing it but I can't find where it says how the data was generated, it mostly refers back to the previous paper which started they used 3.5
I'd not be too surprised but I can't find anything in the technical report paper saying they're using 4 specifically.
The whole point of these papers is that training data quality is key.
I would much prefer for these companies to release the training data than the weights. But that will never happen.
"We speculate that the creation of synthetic datasets will become, in the near future, an important technical skill and a central topic of research in AI."