| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by dinfinity 322 days ago

Let's remember that the basic defense against model collapse is just not training on AI-generated and other crap data.

Sure, there are places where determining whether it is AI-generated or 'real' is hard, but there are plenty of places where the trust in the provider provides enough basis to include the data during curation. For example, it's not as if the NYT will suddenly start pumping out unchecked AI slop.

And then there is the enormous potential of data synthesized aided by, but not completely generated by AI and validated for accuracy through systematic means.