|
|
|
|
|
by afh1
665 days ago
|
|
Interesting read, I did not know about Common Crawl. I feel like RTBF is kind of a lost battle these days with more and more crawlers for AI and whatnot. Once on the internet there is no way back, for better or for worse. This tangent aside, 8TB is really not a lot of data, it's just 8 consumer-grade 1TB hard drives. I find it hard to believe this is "the largest corpus of PDFs online", maybe the largest public one. Not sure how representative it is of "the whole internet". |
|
For those of us who aren't familiar with this random acronym, I think RTBF = right to be forgotten.