|
|
|
|
|
by anas-awadalla
700 days ago
|
|
Hello! Creator of MINT here. We do a lot of pre-processing of commoncrawl (which in its raw form isn’t all that useful for training models). This includes heuristics to remove low quality text and images and deduplicating documents, paragraphs, and images. All of these are crucial to achieve good training performance. On your point regarding PDFs, we actually don’t constraint ourselves to the 1MB files and do our own downloading of PDFs! |
|