Hacker News new | ask | show | jobs
by asdff 946 days ago
It wouldn't surprise me if the vast bulk of what they trained on from the Common Crawl was algorithmically written spam content anyhow. Not to mention at this point how potential datasets going forward are all going to be polluted with ai generated content, entrenching the bias in these training sets. I wouldn't be surprised if a certain percentage of HN comments now are from people testing language models. Certainly reddit and other popular websites have been polluted for years now even before the latest crop of gpts.