https://en.wikipedia.org/wiki/Common_Crawl
I also have a an odd hunch ChatGPT might have used a scihub mirror as inputs for example.