|
|
|
|
|
by asdff
946 days ago
|
|
It wouldn't surprise me if the vast bulk of what they trained on from the Common Crawl was algorithmically written spam content anyhow. Not to mention at this point how potential datasets going forward are all going to be polluted with ai generated content, entrenching the bias in these training sets. I wouldn't be surprised if a certain percentage of HN comments now are from people testing language models. Certainly reddit and other popular websites have been polluted for years now even before the latest crop of gpts. |
|