Hacker News new | ask | show | jobs
by zffr 335 days ago
PDFs don’t always contain actual text. Sometimes they just contain instructions to draw the letters.

For that reason, IMO rendering a PDF page as an image is a very reasonable way to extract information out of it.

For the other formats you mentioned, I agree that it is probably better to parse the document instead.

3 comments

> PDFs don’t always contain actual text. Sometimes they just contain instructions to draw the letters.

Yeah, but when they do, it makes a difference.

Also, speaking from experience, most invoices do contain actual text.

Completely agree with this. This is what we've observed in production too. Embedding images makes the RAG a lot more robust to the "inner workings" of a document.
The more I learn about PDF, the more I am : what?
It makes sense. If you "print" to pdf it makes far more sense to keep the vector representation around. Rasterizing it would simultaneously bloat the file size and lower the quality level when transformed.