Hacker News new | ask | show | jobs
by themanmaran 335 days ago
Hey we've done a lot of research on this side [1] (OCR vs direct image + general LLM benchmarking).

The biggest problem with direct image extraction is multipage documents. We found that single page extraction (OCR=>LLM vs Image=LLM) slightly favored the direct image extraction. But anything beyond 5 images had a sharp fall off in accuracy compared to OCR first.

Which makes sense, long context recall over text is already a hard problem, but that's what LLMs are optimized for. Long context recall over images is still pretty bad.

[1] https://getomni.ai/blog/ocr-benchmark

1 comments

That's an interesting point. We've found that for most use cases, over 5 pages of context is overkill. Having a small LLM conversion layer on top of images also ends up working pretty well (i.e. instead of direct OCR, passing batches of 5 images - if you really need that many - to smaller vision models and having them extract the most important points from the document).

We're currently researching surgery on the cache or attention maps for LLMs to have larger batches of images work better. Seems like Sliding window or Infinite Retrieval might be promising directions to go into.

Also - and this is speculation - I think that the jump in multimodal capabilities that we're seeing from models is only going to increase, meaning long-context for images is probably not going to be a huge blocker as models improve.

This just depends a lot on how well you can parse down the context prior to passing to an LLM.

Ex: Reading contracts or legal documents. Usually a 50 page document that you can't very effectively cherry pick from. Since different clauses or sections will be referenced multiple times across the full document.

In these scenarios, it's almost always better to pass the full document into the LLM rather than running RAG. And if you're passing the full document it's better as text rather than images.