Hacker News new | ask | show | jobs
by Aedelon 121 days ago
Survey of 65+ papers on model collapse. Key finding from Dohmatob et al. (ICLR 2025): even 0.1% synthetic contamination in training data causes measurable degradation.

No major dataset (FineWeb, RedPajama, C4) currently filters for AI-generated content.