Hacker News new | ask | show | jobs
by orbital-decay 803 days ago
>Feeding AIs their own output as training material is a bad thing for mathematical reasons

Most model collapse studies explore degenerate cases to determine the potential limits of the training process of the same model. No wonder you will get terrible results if you recursively recompress a JPEG 100 times! In real world it's nowhere near that bad, because models are never trained on their output alone and always guaranteed to receive the certain amount of external data, starting from the manual dataset curation (yes, that's also fresh data in itself).

Meanwhile, synthetic datasets are entirely common. I suspect this is a non-issue that is way overblown by people misinterpreting these studies.

1 comments

I suspect it's overblown today. Hopefully it'll be overblown indefinitely.

However, if AIs become as successful as Nvidia stock price implies, it could indeed become difficult to find text that is guaranteed to not be AI. It is conceivable that in 20 years it will be very difficult to generate a training set at any scale that isn't 90% already touched by AIs.

Of course, it's conceivable that in 20 years we'll have AIs that don't need the equivalent of millennia of training to come up to their full potential. The problem is much more tractable if one merely needs to produce megabytes of training data to obtain a decent understanding of English rather than many gigabytes.