|
|
|
|
|
by lolinder
471 days ago
|
|
> with LLM as a judge For anyone else interested, prompt is here [0]. The model used was gemini-2.0-flash-001. Benchmarks are hard, and I understand the appeal of having something that seems vaguely deterministic rather than having a human in the loop, but I have a very hard time accepting any LLM-judged benchmarks at face value. This is doubly true when we're talking about something like OCR which, as you say, is a very hard problem for computers of any sort. I'm assuming you've given this some thought—how did you arrive at using an LLM to benchmark OCR vs other LLMs? What limitations with your benchmark have you seen/are you aware of? [0] https://github.com/VikParuchuri/marker/blob/master/benchmark... |
|
- Every document has ground truth text, a JSON schema, and the ground truth JSON.
- Run OCR on each document and pass the result to GPT-4o along with the JSON Schema
- Compare the predicted JSON against the ground truth JSON for accuracy.
In our benchmark, the ground truth text => gpt-4o was 99.7%+ accuracy. Meaning whenever gpt-4o was given the correct text, it could extract the structured JSON values ~100% of the time. So if we pass in the OCR text from Mistral and it scores 70%, that means the inaccuracies are isolated to OCR errors.
https://github.com/getomni-ai/benchmark