| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by ks2048 700 days ago
	It looks like it contains data from CommonCrawl and ArXiv. It's not clear what kind of processing they did, but sometimes these releases seem like just repackaging existing datasets with your name own name on them. It's not hard to get bulk downloads from these sources directly. I thought CommonCrawl truncated files at 1MB. I wonder if the PDFs for CommonCrawl were re-fetched from the URLs. That could be useful if they provide simple way to get those full files.

1 comments

anas-awadalla 700 days ago

Hello! Creator of MINT here.

We do a lot of pre-processing of commoncrawl (which in its raw form isn’t all that useful for training models). This includes heuristics to remove low quality text and images and deduplicating documents, paragraphs, and images. All of these are crucial to achieve good training performance.

On your point regarding PDFs, we actually don’t constraint ourselves to the 1MB files and do our own downloading of PDFs!

link

ks2048 700 days ago

I see. Thanks for the reply. I opened one of the tar files and see now how it has extracted the text into json files.

link