|
|
|
|
|
by ks2048
700 days ago
|
|
It looks like it contains data from CommonCrawl and ArXiv. It's not clear what kind of processing they did, but sometimes these releases seem like just repackaging existing datasets with your name own name on them. It's not hard to get bulk downloads from these sources directly. I thought CommonCrawl truncated files at 1MB. I wonder if the PDFs for CommonCrawl were re-fetched from the URLs. That could be useful if they provide simple way to get those full files. |
|
We do a lot of pre-processing of commoncrawl (which in its raw form isn’t all that useful for training models). This includes heuristics to remove low quality text and images and deduplicating documents, paragraphs, and images. All of these are crucial to achieve good training performance.
On your point regarding PDFs, we actually don’t constraint ourselves to the 1MB files and do our own downloading of PDFs!