Hacker News new | ask | show | jobs
by SnowflakeOnIce 664 days ago
The common crawl only pulls documents less than a small limit (1MiB last I checked). Without special handling in this project, bigger documents than that would be missing.

So indeed, not representative of the whole Internet.

1 comments

From the article:

>Specifically, when Common Crawl gets to a pdf, it just stores the first megabyte of information and truncates the rest.

This is where SafeDocs or CC-MAIN-2021-31-PDF-UNTRUNCATED enters the picture. This corpus was originally created by the DARPA SafeDocs program and what it did was refetch all the different pdfs from a snapshot of Common Crawl to have untruncated versions of them.