Hacker News new | ask | show | jobs
by platinumrad 54 days ago
People love harping on this one, but model collapse hasn't turned out to be an issue in practice.
7 comments

“It’s been a whole year or two and nothing bad has happened, checkmate doomers!”

It’s pretty shocking how much web content and forum posts are either partially or completely LLM-generated these days. I’m pretty sure feeding this stuff back into models is widely understood to not be a good thing.

What do you imagine distillation being then?
It feels like if it does happen, it will take a lot longer to show up. Also, I doubt they would ship a model that turns out this corrupted stuff.

It wont mean we see the model collapse in public, more we struggle to get to the next quality increase.

There's been symptoms of it that have shown up such as the colloquially called "piss filter" and the the anime mole nose problem, but so far they've been symptoms rather than a fatal expression of a disease. That they are symptoms however shows they can be terminal if exploited properly and profusely. So far we haven't seen anyone capable of the "profusely" part.
Besides models get distilled for fun and profit all the time, which on its own does not support the theory of model collapse.
It doesn't seem like anything has changed to preclude it as a possible outcome yet.
I don't really understand why model collapse would happen.

I understand that if I have an AI model and then feed it its own responses it will degrade in performance. But that's not what's happening in the wild though - there are extra filtering steps in-between. Users upvote and downvote posts, people post the "best" AI generated content (that they prefer), the more human sounding AI gets more engagement etc. All of these things filter AI output, so it's not the same thing as:

AI out -> AI in

It is:

AI out -> human filter -> AI in

And at that point the human filter starts acting like a fitness function for a genetic algorithm. Can anyone explain how this still leads to model collapse? Does the signal in the synthetic data just overpower the human filter?

> Users upvote and downvote posts, people post the "best" AI generated content (that they prefer), the more human sounding AI gets more engagement etc. All of these things filter AI output

At the same time though AI generated content can be generated much much faster than human generated content so eventually AI slop downs out anything else. You only have to check the popular social media platforms to see this in action and AI generated posts are widely promoted and pushed on users the same way most web searches return results with AI generated pages ranked highly.

Humans can't keep up and companies are actively working to bypass the human filter and intentionally promote AI generated content.

The past is not a good predictor of future performance.