|
|
|
|
|
by SnowflakeOnIce
664 days ago
|
|
The common crawl only pulls documents less than a small limit (1MiB last I checked). Without special handling in this project, bigger documents than that would be missing. So indeed, not representative of the whole Internet. |
|
>Specifically, when Common Crawl gets to a pdf, it just stores the first megabyte of information and truncates the rest.
This is where SafeDocs or CC-MAIN-2021-31-PDF-UNTRUNCATED enters the picture. This corpus was originally created by the DARPA SafeDocs program and what it did was refetch all the different pdfs from a snapshot of Common Crawl to have untruncated versions of them.