Hacker News new | ask | show | jobs
by mnsu 496 days ago
For multi-modal models, why not? They would be probably some of the best data.
1 comments

Sometimes the PDF of a book is big because the book's packed with important illustrations and charts - like a textbook or journal paper.

Other times a PDF of a book is big because someone scanned it and didn't have trustworthy OCR, so they figured distributing images of text at 1.5 MB per page was better than risking OCR errors.