If a model was trained in 1837, would it be useful even today? How models would be trained in 2037 when most of the web might be autogenerated on the fly like that cgi-bin era?
State of the art models aren't trained the same way as the first models were. High quality datasets are both much more valuable and more useful than simply feeding everything you could possibly crawl into it. Throwing in the kitchen sink and then some is a great way to burn money while also hurting your model accuracy.
Are there any publications out there analyzing this more in depth? How are these datasets scheduled? Do you have your highest quality data first, or do you actually train using "dumb" data first until you establish some general language understanding before giving the high quality information? There is a lot of interesting research to do here that I'm sure people have already investigated....
I don't follow the hype to close, but I guess the early models were trained on data that was classified by underpaid 3rd world workers en masse.
Today you could use your yesterdays model to classify the data for you and build from that. Heck, you can even create a synthetic data with current tech.
The quality of your model is going to match at best the quality of the data. If you use yesterday's model to label data/create a synthetic dataset, then the new model built on top of it cannot go beyond that. If it can, then it can also do it (and better) with the data that trained yesterday's model.
This is not an accurate assessment; the forward-pass is nontrivial, i.e. you're always adding new information. When they say "synthetic" datasets, nobody is suggesting that the past model is used to invent it completely. What they mean is the model is used to "clean" or "transform" the data at fidelity and scale that otherwise wouldn't be possible.
We do this in fine-tuning all the time: see reverse prompting, etc.
My bad then, I have not seen it done successfully yet. Do you happen to have some references at hand? I would be more than grateful! Thanks in advance!
You can create inputs for DPO/ORPO synthetically which is a huge one as previously it would require gigantic investments https://arxiv.org/abs/2402.10379
There's also the gemma2 paper has advanced SOTA in distil; on a side-note, there's many reasons for it but vocab_size and good sizes 9b/27b, IMHO it's currently the best model for i.e. Ukrainian. in fact, I prefer it to anything else there's, including the much larger llama's—by a mile! The model is a triumph of synthetic datasets. https://arxiv.org/abs/2408.00118