| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by zffr 335 days ago

PDFs don’t always contain actual text. Sometimes they just contain instructions to draw the letters.

For that reason, IMO rendering a PDF page as an image is a very reasonable way to extract information out of it.

For the other formats you mentioned, I agree that it is probably better to parse the document instead.

3 comments

ashishb 335 days ago

> PDFs don’t always contain actual text. Sometimes they just contain instructions to draw the letters.

Yeah, but when they do, it makes a difference.

Also, speaking from experience, most invoices do contain actual text.

link

ArnavAgrawal03 335 days ago

Completely agree with this. This is what we've observed in production too. Embedding images makes the RAG a lot more robust to the "inner workings" of a document.

link

barrenko 335 days ago

The more I learn about PDF, the more I am : what?

link

fc417fc802 335 days ago

It makes sense. If you "print" to pdf it makes far more sense to keep the vector representation around. Rasterizing it would simultaneously bloat the file size and lower the quality level when transformed.

link