| There's multiple fundamental problems people need to be aware of. - LLM's are typically pre-trained on 4k text tokens and then extrapolated out to longer context windows (it's easy to go from 4000 text tokens to 4001). This is not possible with images due to how they're tokenized. As a result, you're out of distribution - hallucinations become a huge problem once you're dealing with more than a couple of images. - Pdf's at 1536 × 2048 use 3 to 5X more tokens than the raw text (ie higher inference costs and slower responses). Going lower results in blurry images. - Images are inherently a much heavier representation in raw size too, you're adding latency to every request to just download all the needed images. Their very small benchmark is obviously going to outperform basic text chunking on finance docs heavy with charts and tables. I would be far more interested in seeing an OCR step added with Gemini (which can annotate images) and then comparing results. An end to end image approach makes sense in certain cases (like patents, architecture diagrams, etc) but it's a last resort. |
"There's multiple fundamental problems people need to be aware of. - LLM's are typically pre-trained on text tokens and then extrapolated out to longer context windows (it's easy to go from 4000 text tokens to 4001). This is not possible with images due to how they're tokenized. As a result, you're out of distribution - hallucinations become a huge problem once you're dealing with more than a couple of images. - A PNG at 512x 2048 is 3.5k more tokens than the raw text (so higher inference costs and slower responses). Going lower results in blurry images. - Images are inherently a much heavier representation in raw size too, you're adding latency to every request to just download all the needed images.
Their very small benchmark is obviously going to outperform basic text chunking on finance docs heavy with charts and tables. I would be far more interested in seeing an OCR step added with Gemini (which can annotate images) and then comparing results.
An end to end image approach makes sense in certain cases (like patents, architecture diagrams, etc) but it's a last resort."
It mostly gets it right but notice it changes "Pdf's at 1536 × 2048 use 3 to 5X more tokens" to "A PNG at 512x 2048 is 3.5k more tokens".