Hacker News new | ask | show | jobs
by alchemist1e9 1289 days ago
This was likely a significant percentage of the input data:

https://en.wikipedia.org/wiki/Common_Crawl

I also have a an odd hunch ChatGPT might have used a scihub mirror as inputs for example.