| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by joegibbs 330 days ago

I think it would be good to combine traditional OCR with an LLM to fix up mistakes and add diagram representations - LLMs have the problem of just inventing plausible-sounding text if it can't read it, which is worse than just garbling the result. For instance, GPT4.1 worked perfectly with a screenshot your comment at 1296 × 179 but if I zoom out to 50% and give it a 650 × 84 screenshot instead, the result is:

"There's multiple fundamental problems people need to be aware of. - LLM's are typically pre-trained on text tokens and then extrapolated out to longer context windows (it's easy to go from 4000 text tokens to 4001). This is not possible with images due to how they're tokenized. As a result, you're out of distribution - hallucinations become a huge problem once you're dealing with more than a couple of images. - A PNG at 512x 2048 is 3.5k more tokens than the raw text (so higher inference costs and slower responses). Going lower results in blurry images. - Images are inherently a much heavier representation in raw size too, you're adding latency to every request to just download all the needed images.

Their very small benchmark is obviously going to outperform basic text chunking on finance docs heavy with charts and tables. I would be far more interested in seeing an OCR step added with Gemini (which can annotate images) and then comparing results.

An end to end image approach makes sense in certain cases (like patents, architecture diagrams, etc) but it's a last resort."

It mostly gets it right but notice it changes "Pdf's at 1536 × 2048 use 3 to 5X more tokens" to "A PNG at 512x 2048 is 3.5k more tokens".