Hacker News new | ask | show | jobs
by ashishb 335 days ago
I speak from experience that this is a bad idea.

There are cases where documents contains text with letters that look the same in many font. For example, 0 and O looks identical in many fonts. So if you have a doc/xls/PDF/html then you lose information by converting it into an image.

For cases like serial numbers, not even humans can distinguish 0 vs O (or l vs I) by looking at them.

3 comments

PDFs don’t always contain actual text. Sometimes they just contain instructions to draw the letters.

For that reason, IMO rendering a PDF page as an image is a very reasonable way to extract information out of it.

For the other formats you mentioned, I agree that it is probably better to parse the document instead.

> PDFs don’t always contain actual text. Sometimes they just contain instructions to draw the letters.

Yeah, but when they do, it makes a difference.

Also, speaking from experience, most invoices do contain actual text.

Completely agree with this. This is what we've observed in production too. Embedding images makes the RAG a lot more robust to the "inner workings" of a document.
The more I learn about PDF, the more I am : what?
It makes sense. If you "print" to pdf it makes far more sense to keep the vector representation around. Rasterizing it would simultaneously bloat the file size and lower the quality level when transformed.
This is within the context of using it as an alternative to OCR, which would suffer the same issues, with more duct tape and string infrastructure and cost.
Strangely the linked marketing text repeatedly comments regarding OCR errors (I counted at least 4 separate instances), which is extremely weird because such a visual RAG suffers precisely the same problem. It is such a weird thing to repeatedly harp on.

If the OCR has a problem understanding varying fonts and text, there is zero reason using embeddings instead is immune to this.

I’m confused. Wouldn’t the LLM be able to read the text more correctly than traditional OCR by virtue of inferring what that looks like vs what makes sense for it to look like from training? I would think it would be less prone to making fewer typographic interpretation errors than a more traditional mechanical algorithm.
Modern OCR is using machine learning technologies, including ViT and precisely the same models and technologies used in the linked solution. I mean, if their comparison was with OCR from 2002, sure, but they're comparing against modern OCR solutions that generate text representations of documents, using the very latest machine learning innovations and massive models (along with textual transformer-based contextual inferrals), with their own solution which uses precisely the same stack. It's a weird thing for them to continually harp on.

Their solution is precisely as subject to ambiguities of text that the comparative OCR solutions are.

You can win any race if you can cherry-pick your competitors.
For HTML, in a lot of cases, using the tags to chunk things better works. However, I've found that when I'm trying to design a page, showing models the actual image of the page leads to way better debugging than just sending the code back.

1 vs I or 0 vs O are valid issues, but in practice - and there's probably selection bias here - we've seen documents with a ton of diagrams and charts (that are much simpler to deal with as images).