|
|
|
|
|
by ziddoap
672 days ago
|
|
From the article: >Specifically, when Common Crawl gets to a pdf, it just stores the first megabyte of information and truncates the rest. This is where SafeDocs or CC-MAIN-2021-31-PDF-UNTRUNCATED enters the picture. This corpus was originally created by the DARPA SafeDocs program and what it did was refetch all the different pdfs from a snapshot of Common Crawl to have untruncated versions of them. |
|