Hacker News new | ask | show | jobs
by jerf 805 days ago
The concern is not just a vaguely cynical hand-wringing about how bad AI is. Feeding AIs their own output as training material is a bad thing for mathematical reasons, and feeding AIs the output of other very similar AIs is close enough for it to also be bad. The reasons are subtle and hard to describe in plain English, and I'm not enough of an expert to even try, so pardon if I don't. But given that it is hard to determine if output is from an AI, AI really does face a crisis of having a hard time coming across good training material in the future.
2 comments

>Feeding AIs their own output as training material is a bad thing for mathematical reasons

Most model collapse studies explore degenerate cases to determine the potential limits of the training process of the same model. No wonder you will get terrible results if you recursively recompress a JPEG 100 times! In real world it's nowhere near that bad, because models are never trained on their output alone and always guaranteed to receive the certain amount of external data, starting from the manual dataset curation (yes, that's also fresh data in itself).

Meanwhile, synthetic datasets are entirely common. I suspect this is a non-issue that is way overblown by people misinterpreting these studies.

I suspect it's overblown today. Hopefully it'll be overblown indefinitely.

However, if AIs become as successful as Nvidia stock price implies, it could indeed become difficult to find text that is guaranteed to not be AI. It is conceivable that in 20 years it will be very difficult to generate a training set at any scale that isn't 90% already touched by AIs.

Of course, it's conceivable that in 20 years we'll have AIs that don't need the equivalent of millennia of training to come up to their full potential. The problem is much more tractable if one merely needs to produce megabytes of training data to obtain a decent understanding of English rather than many gigabytes.

can you show me a mathematical reason that cannot philosophically be applied to people also? people only being fed other people output.
I'd go with "no", because people just consuming the output of other people is a big ongoing problem. Input from the universe needs to be added in order to maintain alignment with the universe, for whichever "universe" you are considering. Without frequent reference to reality, people feeding too much on people will inevitably depart from reality.

In another context, you may know this as an "echo chamber". Not quite exactly the same concept, but very, very similar.

I do like to remind people that the AI of today and LLMs are not the whole of reality. Perhaps someday there will be AIs that are also capable of directly consulting the universe, through some sort of body they can use. But the current LLMs, which are trained on some sort of human output, need to exclude AI-generated input or they too will converge on some sort of degenerate attractor.

yep, then we are back a "vaguely cynical hand-wringing about how bad AI is."

currently we have mostly LLMs in the mix. but there are no reason that the Ai mix will not contain embodied agents thst also publish stuff in the internet. (think search and rescue bots that automatically write a report).

Now Ai is connected to reality without people in the mix.

When trying to close a rhetorical trap on someone, it is useful to first be sure they stepped in it.