“It’s been a whole year or two and nothing bad has happened, checkmate doomers!”
It’s pretty shocking how much web content and forum posts are either partially or completely LLM-generated these days. I’m pretty sure feeding this stuff back into models is widely understood to not be a good thing.
There's been symptoms of it that have shown up such as the colloquially called "piss filter" and the the anime mole nose problem, but so far they've been symptoms rather than a fatal expression of a disease. That they are symptoms however shows they can be terminal if exploited properly and profusely. So far we haven't seen anyone capable of the "profusely" part.
I don't really understand why model collapse would happen.
I understand that if I have an AI model and then feed it its own responses it will degrade in performance. But that's not what's happening in the wild though - there are extra filtering steps in-between. Users upvote and downvote posts, people post the "best" AI generated content (that they prefer), the more human sounding AI gets more engagement etc. All of these things filter AI output, so it's not the same thing as:
AI out -> AI in
It is:
AI out -> human filter -> AI in
And at that point the human filter starts acting like a fitness function for a genetic algorithm. Can anyone explain how this still leads to model collapse? Does the signal in the synthetic data just overpower the human filter?
> Users upvote and downvote posts, people post the "best" AI generated content (that they prefer), the more human sounding AI gets more engagement etc. All of these things filter AI output
At the same time though AI generated content can be generated much much faster than human generated content so eventually AI slop downs out anything else. You only have to check the popular social media platforms to see this in action and AI generated posts are widely promoted and pushed on users the same way most web searches return results with AI generated pages ranked highly.
Humans can't keep up and companies are actively working to bypass the human filter and intentionally promote AI generated content.
It’s pretty shocking how much web content and forum posts are either partially or completely LLM-generated these days. I’m pretty sure feeding this stuff back into models is widely understood to not be a good thing.