|
|
|
|
|
by alberto-m
488 days ago
|
|
It seems to me that the software is occasionally doing better than the supposed “ground truth” (who annotated that?), and I don't understand why the authors are blindly following the latter, and the reviewers apparently approved that. In Figure 1 the authors complain that Gemini “misreads 'ss ety!' as 'ness ety!'”, but even a casual look at the image reveals that Gemini's reading is correct. In Figure 11, they state that Claude is “altering the natural sequence of ideas in the ground truth”, except that the sequence in the ground truth makes no sense, while Claude's order does (only the initial “the” is misplaced). |
|
TBH, I'm not sure it's a good test. I can somewhat see the argument against "BASELINE" for ground truth - the underlying text might have been BASE(IAKS), for all we know. But, IMO the ground truth should have been "Direction & ess" at the very least. And, more significantly than that - it's a fake scenario, that we don't care for in practice. Why use that? Use invoices with IDs that sound like words but are not. Use license plates and stuff like that. Heck, use large prints of random characters, mixed with handwritten gibberish.
For at least some of images that they used, the expectation from a good text reader is actually to understand context and not blindly OCR. Take "Trader Joe's": we *know* that's an 's', but only from outside context; from OCR, it might've been an 8, there's really no way to tell. Why accept the "s" in ground truth, but reject the full world "Coconut" (which is obviously what is written on the can, even if partially obscured)? Furthermore, a human would know what kind of products are sold by Trader Joe's, and coupling that with the top of the letters "M I L" that are visible, would deduce that's Coconut Milk. So really, Claude nailed that one.