Hacker News new | ask | show | jobs
by noman-land 542 days ago
I completely don't understand the use for synthetic data. What good it's it to train a model basically on itself?
4 comments

The value of synthetic data relies on having non-zero signal about which generated data is "better" or "worse". In a sense, this what reinforcement learning is about. Ie, generate some data, have that data scored by some evaluator, and then feed the data back into the model with higher weight on the better stuff and lower weight on the worse stuff.

The basic loop is: (i) generate synthetic data, (ii) rate synthetic data, (iii) update model to put more probability on better data and less probability on worse data, then go back to (i).

But who rates the synthetic data? If it is humans, I can understand that this is another way to get human knowledge into it, but if it's rated by AI, isn't it just a convoluted way of copying the rating AI's knowledge?
Many things are more easily scored than produced. Like it's trivial to tell whether a poem rhymes, but writing one is a comparatively slow and difficult task. So hopefully since scoring is easier/more-discerning than generating, the idea is you can generate stuff, classify it as good or bad, and then retrain on the good stuff. It's kindof an article of faith for a lot of AI companies/professionals as well, since it prevents you from having to face a data wall, and is analogous to a human student practicing and learning in an appealing way.

As far as I know it doesn't work very well so far. It is prone to overfitting, where it ranks highly some trivial detail of the output eg "if a summary starts with a byline of the author its a sign of quality" and then starts looping on itself over and over, increasing the frequency and size of bylines until it's totally crommed off to infinity and just repeating a short phrase endlessly. Humans have good baselines and common sense that these ML systems lack, if you've ever seen one of those "deep dream" images it's the same kind of idea. The "most possible dog" image can be looks almost nothing like a dog in the same way that the "most possible poem" may look nothing like a poem.

This is the bit I've never understood about training AI on its own output; won't you just regress to the mean?
It's not trained on its own output. You can generate infinite correctly worked out math traces and train on those.
Thanks, that makes a lot more sense.
This is a good read for some examples https://arxiv.org/abs/2203.14465

> This technique, the "Self-Taught Reasoner" (STaR), relies on a simple loop: generate rationales to answer many questions, prompted with a few rationale examples; if the generated answers are wrong, try again to generate a rationale given the correct answer; fine-tune on all the rationales that ultimately yielded correct answers; repeat. We show that STaR significantly improves performance on multiple datasets compared to a model fine-tuned to directly predict final answers

But there are a few others. In general good data is good data. We're definitely learning more about how to produce good synthetic version.

One issue with that is that the model may learn to smuggle data. You as a human think that the plain reading of the words is what is doing the reasoning, but (part of) the processing is done by the exact comma placement and synonym choice etc.

Data smuggling is a known phenomenon in similar tasks.

I don't think data smuggling is relevant in star style scenarios. You're still validating the final output. If it works on test data, what could be even smuggled.
> What good it's it to train a model basically on itself?

If the model generates data of variable quality, and if there's a good way to distinguish good data from bad data, then training on self-generated data might "bootstrap" a model to better performance.

This is common in reinforcement learning. Famously, AlphaGo Zero (https://en.wikipedia.org/wiki/AlphaGo_Zero) learned exclusively on self-play, without reference to human-played games.

Of course, games have a built-in critic: the better strategy usually wins. It's much harder to judge the answer to a math problem, or decide which essay is more persuasive, or evaluate restaurant recommendations.

If we get to a point where we have a model that when fed a real world stream of data (YouTube, surveillance cameras, forum data, cell phone conversations etc.) and can prune out a good training set for itself then you’re at the point where the LLM is in a feedback loop where it can improve itself. That’s AGI for all intents and purposes.