| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by anas-awadalla 700 days ago

Hello! Creator of MINT here.

We do a lot of pre-processing of commoncrawl (which in its raw form isn’t all that useful for training models). This includes heuristics to remove low quality text and images and deduplicating documents, paragraphs, and images. All of these are crucial to achieve good training performance.

On your point regarding PDFs, we actually don’t constraint ourselves to the 1MB files and do our own downloading of PDFs!

1 comments

ks2048 700 days ago

I see. Thanks for the reply. I opened one of the tar files and see now how it has extracted the text into json files.

link