| I ran a partial benchmark against marker - https://github.com/VikParuchuri/marker . Across 375 samples with LLM as a judge, mistral scores 4.32, and marker 4.41 . Marker can inference between 20 and 120 pages per second on an H100. You can see the samples here - https://huggingface.co/datasets/datalab-to/marker_comparison... . The code for the benchmark is here - https://github.com/VikParuchuri/marker/tree/master/benchmark... . Will run a full benchmark soon. Mistral OCR is an impressive model, but OCR is a hard problem, and there is a significant risk of hallucinations/missing text with LLMs. |
For anyone else interested, prompt is here [0]. The model used was gemini-2.0-flash-001.
Benchmarks are hard, and I understand the appeal of having something that seems vaguely deterministic rather than having a human in the loop, but I have a very hard time accepting any LLM-judged benchmarks at face value. This is doubly true when we're talking about something like OCR which, as you say, is a very hard problem for computers of any sort.
I'm assuming you've given this some thought—how did you arrive at using an LLM to benchmark OCR vs other LLMs? What limitations with your benchmark have you seen/are you aware of?
[0] https://github.com/VikParuchuri/marker/blob/master/benchmark...