| HN Mirror

Training on ai-generated data isn't a problem, and has been routinely done by everyone for 18 mo +.

The issue is training on 'indiscriminate' ai-generated data. This just leads to more and more degenerate results. No one is doing this however, there is always some kind of filtering to select which generated data to use for training. So the finding of that paper are entirely not surprising, and frankly, intuitive and already well known.