| We recently published an open source benchmark [1] specifically for evaluating VLM vs OCR. And generally the VLMs did much better than the traditional OCR models. VLM highlights: - Handwriting. Being contextually aware helps here. i.e. they read the document like a human would, interpreting the whole word/sentence instead of character by character - Charts/Infographics. VLMs can actually interpret charts or flow diagrams into a text format. Including things like color coded lines. Traditional OCR highlights: - Standardized documents (e.x. US tax forms that they've been trained on) - Dense text. Imagine textbooks and multi column research papers. This is the easiest OCR use case, but VLMS really struggle as the number of output tokens increase. - Bounding boxes. There still isn't really a model that gives super precise bounding boxes. Supposedly Gemini and Qwen were trained for it, but they don't perform as well as traditional models. There's still a ton of room for improvement, but especially with models like Gemini the accuracy/cost is really competitive. [1] https://github.com/getomni-ai/benchmark |
As you mentioned there are a few caveats to VLMs that folks are typically unaware of (not at all exhaustive, but the ones you highlighted):
1. Long-form text (dense): Token limits of 4/8K mean that dense pages may go over limits of the LLM outputs. This requires some careful work to make them work as seamlessly as OCR.
2. Visual grounding a.k.a. bounding boxes are definitely one of those things that VLMs aren't natively good at (partly because the cross-entropy losses used aren't really geared for bounding box regression). We're definitely making some strides here [1] to improve that so you're going to get an experience that is almost as good as native bounding box regression (all within the same VLM). [1]
[1] https://colab.research.google.com/github/vlm-run/vlmrun-cook...