|
|
|
|
|
by jerf
805 days ago
|
|
The concern is not just a vaguely cynical hand-wringing about how bad AI is. Feeding AIs their own output as training material is a bad thing for mathematical reasons, and feeding AIs the output of other very similar AIs is close enough for it to also be bad. The reasons are subtle and hard to describe in plain English, and I'm not enough of an expert to even try, so pardon if I don't. But given that it is hard to determine if output is from an AI, AI really does face a crisis of having a hard time coming across good training material in the future. |
|
Most model collapse studies explore degenerate cases to determine the potential limits of the training process of the same model. No wonder you will get terrible results if you recursively recompress a JPEG 100 times! In real world it's nowhere near that bad, because models are never trained on their output alone and always guaranteed to receive the certain amount of external data, starting from the manual dataset curation (yes, that's also fresh data in itself).
Meanwhile, synthetic datasets are entirely common. I suspect this is a non-issue that is way overblown by people misinterpreting these studies.