Hacker News new | ask | show | jobs
by tokai 672 days ago
Yeah 8TB is really tiny. Google scholar was estimated to index 160.000.000 pdfs in 2015.[0] If we assume that a third of those are not behind paywalls, and average pdf size is 1mb, its ends up as something above 50TB of documents. Almost ten years later the number of available pdfs of just scholarly communication should be substantially higher.

[0] https://link.springer.com/article/10.1007/s11192-015-1614-6

1 comments

Anna's archive has some 300M pdfs.
We're talking about the open web here. But yeah that's the point, the dataset is unreasonably small.