| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by serjester 329 days ago

There's multiple fundamental problems people need to be aware of.

- LLM's are typically pre-trained on 4k text tokens and then extrapolated out to longer context windows (it's easy to go from 4000 text tokens to 4001). This is not possible with images due to how they're tokenized. As a result, you're out of distribution - hallucinations become a huge problem once you're dealing with more than a couple of images.

- Pdf's at 1536 × 2048 use 3 to 5X more tokens than the raw text (ie higher inference costs and slower responses). Going lower results in blurry images.

- Images are inherently a much heavier representation in raw size too, you're adding latency to every request to just download all the needed images.

Their very small benchmark is obviously going to outperform basic text chunking on finance docs heavy with charts and tables. I would be far more interested in seeing an OCR step added with Gemini (which can annotate images) and then comparing results.

An end to end image approach makes sense in certain cases (like patents, architecture diagrams, etc) but it's a last resort.

6 comments

joegibbs 329 days ago

I think it would be good to combine traditional OCR with an LLM to fix up mistakes and add diagram representations - LLMs have the problem of just inventing plausible-sounding text if it can't read it, which is worse than just garbling the result. For instance, GPT4.1 worked perfectly with a screenshot your comment at 1296 × 179 but if I zoom out to 50% and give it a 650 × 84 screenshot instead, the result is:

"There's multiple fundamental problems people need to be aware of. - LLM's are typically pre-trained on text tokens and then extrapolated out to longer context windows (it's easy to go from 4000 text tokens to 4001). This is not possible with images due to how they're tokenized. As a result, you're out of distribution - hallucinations become a huge problem once you're dealing with more than a couple of images. - A PNG at 512x 2048 is 3.5k more tokens than the raw text (so higher inference costs and slower responses). Going lower results in blurry images. - Images are inherently a much heavier representation in raw size too, you're adding latency to every request to just download all the needed images.

An end to end image approach makes sense in certain cases (like patents, architecture diagrams, etc) but it's a last resort."

It mostly gets it right but notice it changes "Pdf's at 1536 × 2048 use 3 to 5X more tokens" to "A PNG at 512x 2048 is 3.5k more tokens".

link

pilooch 329 days ago

True but modern models such as gemma3 pan& scan and other tricks such as training from multiple resolutions do alleviate these issues.

An interesting property of the gemma3 family is that increasing the input image siwmze actually does not increase processing memory requirements, because a second stage encoder actually compresses it into fixed size tokens. Very neat in practice.

link

ArnavAgrawal03 329 days ago

You can add OCR with Gemini, and presumably that would lead to better results than the OCR model we compared against. However, it's important to note that then you're guaranteeing that the entire corpus of documents you're processing will go through a large VLM. That can be prohibitively expensive and slow.

Definitely trade-offs to be made here, we found this to be the most effective in most cases.

link

serjester 329 days ago

VLM’s capable of parsing images with high fidelity are 10 - 50X cheaper than the frontier models. Any savings from not parsing, are quickly going to be wiped out if someone has any actual traffic. Not to mention the massive hits to long context accuracy and latency.

link

tom_m 329 days ago

That's what their document parse product is for. I think people feed things to an LLM sometimes and sure it might work but it could also be the wrong tool for the job. Not everything needs to run through the LLM.

link

hdjrudni 329 days ago

LLMs are exactly the tool to use when other parsing methods fail due to poor formatting. AI is for the fuzzy cases.

link

CGamesPlay 329 days ago

This makes sense, but is something to shaking up the RAG pipeline? Perhaps you could take each RAG result and then do a model processing step to ask it to extract relevant information from the image directly pertaining to the user query, once per result, and then aggregate those (text) results as the input to your final generation. That would sidestep the token limit for multiple images, and allow parallelizing the image understanding step.

link

woctordho 329 days ago

Context window extrapolation should work with hierarchical/multi-scale tokenization of images, such as Haar wavelets

link