| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by wg0 614 days ago
	If a model was trained in 1837, would it be useful even today? How models would be trained in 2037 when most of the web might be autogenerated on the fly like that cgi-bin era?

1 comments

Etheryte 614 days ago

State of the art models aren't trained the same way as the first models were. High quality datasets are both much more valuable and more useful than simply feeding everything you could possibly crawl into it. Throwing in the kitchen sink and then some is a great way to burn money while also hurting your model accuracy.

link

kettleballroll 614 days ago

Are there any publications out there analyzing this more in depth? How are these datasets scheduled? Do you have your highest quality data first, or do you actually train using "dumb" data first until you establish some general language understanding before giving the high quality information? There is a lot of interesting research to do here that I'm sure people have already investigated....

link

zeroq 614 days ago

I don't follow the hype to close, but I guess the early models were trained on data that was classified by underpaid 3rd world workers en masse. Today you could use your yesterdays model to classify the data for you and build from that. Heck, you can even create a synthetic data with current tech.

link

youoy 614 days ago

The quality of your model is going to match at best the quality of the data. If you use yesterday's model to label data/create a synthetic dataset, then the new model built on top of it cannot go beyond that. If it can, then it can also do it (and better) with the data that trained yesterday's model.

link

tucnak 614 days ago

This is not an accurate assessment; the forward-pass is nontrivial, i.e. you're always adding new information. When they say "synthetic" datasets, nobody is suggesting that the past model is used to invent it completely. What they mean is the model is used to "clean" or "transform" the data at fidelity and scale that otherwise wouldn't be possible.

We do this in fine-tuning all the time: see reverse prompting, etc.

link

youoy 614 days ago

My bad then, I have not seen it done successfully yet. Do you happen to have some references at hand? I would be more than grateful! Thanks in advance!

link

tucnak 614 days ago

The LIMA paper, I think, would be a good place to start https://arxiv.org/abs/2305.11206

You can create inputs for DPO/ORPO synthetically which is a huge one as previously it would require gigantic investments https://arxiv.org/abs/2402.10379

There's also the gemma2 paper has advanced SOTA in distil; on a side-note, there's many reasons for it but vocab_size and good sizes 9b/27b, IMHO it's currently the best model for i.e. Ukrainian. in fact, I prefer it to anything else there's, including the much larger llama's—by a mile! The model is a triumph of synthetic datasets. https://arxiv.org/abs/2408.00118

Also see Princeton paper on SimPO which is how they supercharged 9b gemma's recently. https://arxiv.org/abs/2405.14734

link