Hacker News new | ask | show | jobs
by alberto-m 488 days ago
It seems to me that the software is occasionally doing better than the supposed “ground truth” (who annotated that?), and I don't understand why the authors are blindly following the latter, and the reviewers apparently approved that.

In Figure 1 the authors complain that Gemini “misreads 'ss ety!' as 'ness ety!'”, but even a casual look at the image reveals that Gemini's reading is correct.

In Figure 11, they state that Claude is “altering the natural sequence of ideas in the ground truth”, except that the sequence in the ground truth makes no sense, while Claude's order does (only the initial “the” is misplaced).

5 comments

I think the goal here was to convince the AI to actually read chars ("OCR") rather than speculate what might be written on paper/in the image. Hence why the ground truth is explicitly removing the letters & word parts that are obscured, even when they can be guessed.

TBH, I'm not sure it's a good test. I can somewhat see the argument against "BASELINE" for ground truth - the underlying text might have been BASE(IAKS), for all we know. But, IMO the ground truth should have been "Direction & ess" at the very least. And, more significantly than that - it's a fake scenario, that we don't care for in practice. Why use that? Use invoices with IDs that sound like words but are not. Use license plates and stuff like that. Heck, use large prints of random characters, mixed with handwritten gibberish.

For at least some of images that they used, the expectation from a good text reader is actually to understand context and not blindly OCR. Take "Trader Joe's": we *know* that's an 's', but only from outside context; from OCR, it might've been an 8, there's really no way to tell. Why accept the "s" in ground truth, but reject the full world "Coconut" (which is obviously what is written on the can, even if partially obscured)? Furthermore, a human would know what kind of products are sold by Trader Joe's, and coupling that with the top of the letters "M I L" that are visible, would deduce that's Coconut Milk. So really, Claude nailed that one.

I think there are multiple possible goals we could imagine in text recognition tasks. Should the AI guess the occluded text? That could be really helpful in some instances. But if the goal is OCR, then it should only recognize characters optically, and any guessing at occluded characters is undesired.
Maybe a better goal is some representation for "COCONUT [with these 3 letters occluded]". Then the consumer might combine this with other evidence about the occluded parts, or review it if questions come up about how accurate the OCR was in this case.
In the very first example (occluded text) the "ground truth" is just incorrect.
Re: reviewers, I don't see any mention of this being accepted into a peer-reviewed venue. Peer review isn't necessary for arxiv submissions.
You are right, I missed that.
>reviewers apparently approved that.

What reviewers?