Hacker News new | ask | show | jobs
by serjester 329 days ago
There's multiple fundamental problems people need to be aware of.

- LLM's are typically pre-trained on 4k text tokens and then extrapolated out to longer context windows (it's easy to go from 4000 text tokens to 4001). This is not possible with images due to how they're tokenized. As a result, you're out of distribution - hallucinations become a huge problem once you're dealing with more than a couple of images.

- Pdf's at 1536 × 2048 use 3 to 5X more tokens than the raw text (ie higher inference costs and slower responses). Going lower results in blurry images.

- Images are inherently a much heavier representation in raw size too, you're adding latency to every request to just download all the needed images.

Their very small benchmark is obviously going to outperform basic text chunking on finance docs heavy with charts and tables. I would be far more interested in seeing an OCR step added with Gemini (which can annotate images) and then comparing results.

An end to end image approach makes sense in certain cases (like patents, architecture diagrams, etc) but it's a last resort.

6 comments

I think it would be good to combine traditional OCR with an LLM to fix up mistakes and add diagram representations - LLMs have the problem of just inventing plausible-sounding text if it can't read it, which is worse than just garbling the result. For instance, GPT4.1 worked perfectly with a screenshot your comment at 1296 × 179 but if I zoom out to 50% and give it a 650 × 84 screenshot instead, the result is:

"There's multiple fundamental problems people need to be aware of. - LLM's are typically pre-trained on text tokens and then extrapolated out to longer context windows (it's easy to go from 4000 text tokens to 4001). This is not possible with images due to how they're tokenized. As a result, you're out of distribution - hallucinations become a huge problem once you're dealing with more than a couple of images. - A PNG at 512x 2048 is 3.5k more tokens than the raw text (so higher inference costs and slower responses). Going lower results in blurry images. - Images are inherently a much heavier representation in raw size too, you're adding latency to every request to just download all the needed images.

Their very small benchmark is obviously going to outperform basic text chunking on finance docs heavy with charts and tables. I would be far more interested in seeing an OCR step added with Gemini (which can annotate images) and then comparing results.

An end to end image approach makes sense in certain cases (like patents, architecture diagrams, etc) but it's a last resort."

It mostly gets it right but notice it changes "Pdf's at 1536 × 2048 use 3 to 5X more tokens" to "A PNG at 512x 2048 is 3.5k more tokens".

True but modern models such as gemma3 pan& scan and other tricks such as training from multiple resolutions do alleviate these issues.

An interesting property of the gemma3 family is that increasing the input image siwmze actually does not increase processing memory requirements, because a second stage encoder actually compresses it into fixed size tokens. Very neat in practice.

You can add OCR with Gemini, and presumably that would lead to better results than the OCR model we compared against. However, it's important to note that then you're guaranteeing that the entire corpus of documents you're processing will go through a large VLM. That can be prohibitively expensive and slow.

Definitely trade-offs to be made here, we found this to be the most effective in most cases.

VLM’s capable of parsing images with high fidelity are 10 - 50X cheaper than the frontier models. Any savings from not parsing, are quickly going to be wiped out if someone has any actual traffic. Not to mention the massive hits to long context accuracy and latency.
That's what their document parse product is for. I think people feed things to an LLM sometimes and sure it might work but it could also be the wrong tool for the job. Not everything needs to run through the LLM.
LLMs are exactly the tool to use when other parsing methods fail due to poor formatting. AI is for the fuzzy cases.
This makes sense, but is something to shaking up the RAG pipeline? Perhaps you could take each RAG result and then do a model processing step to ask it to extract relevant information from the image directly pertaining to the user query, once per result, and then aggregate those (text) results as the input to your final generation. That would sidestep the token limit for multiple images, and allow parallelizing the image understanding step.
Context window extrapolation should work with hierarchical/multi-scale tokenization of images, such as Haar wavelets