|
|
|
|
|
by buildbot
673 days ago
|
|
No torrents at all in this data, all publicly available/open access. Mostly scientific pdfs, and a good portion of those are scans not just text. So the actual text amount is probably pretty low compared to the total. But still, a lot more than 8TB of raw data out there. I bet the total number of PDFs is close to a petabyte if not more. |
|
That's a safe bet. I'v seen PDF's in the GBs from users treating it like a container format (which it is).