|
|
|
|
|
by themanmaran
335 days ago
|
|
Hey we've done a lot of research on this side [1] (OCR vs direct image + general LLM benchmarking). The biggest problem with direct image extraction is multipage documents. We found that single page extraction (OCR=>LLM vs Image=LLM) slightly favored the direct image extraction. But anything beyond 5 images had a sharp fall off in accuracy compared to OCR first. Which makes sense, long context recall over text is already a hard problem, but that's what LLMs are optimized for. Long context recall over images is still pretty bad. [1] https://getomni.ai/blog/ocr-benchmark |
|
We're currently researching surgery on the cache or attention maps for LLMs to have larger batches of images work better. Seems like Sliding window or Infinite Retrieval might be promising directions to go into.
Also - and this is speculation - I think that the jump in multimodal capabilities that we're seeing from models is only going to increase, meaning long-context for images is probably not going to be a huge blocker as models improve.