| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by alberto-m 488 days ago

It seems to me that the software is occasionally doing better than the supposed “ground truth” (who annotated that?), and I don't understand why the authors are blindly following the latter, and the reviewers apparently approved that.

In Figure 1 the authors complain that Gemini “misreads 'ss ety!' as 'ness ety!'”, but even a casual look at the image reveals that Gemini's reading is correct.

In Figure 11, they state that Claude is “altering the natural sequence of ideas in the ground truth”, except that the sequence in the ground truth makes no sense, while Claude's order does (only the initial “the” is misplaced).

5 comments

virgilp 488 days ago

I think the goal here was to convince the AI to actually read chars ("OCR") rather than speculate what might be written on paper/in the image. Hence why the ground truth is explicitly removing the letters & word parts that are obscured, even when they can be guessed.

TBH, I'm not sure it's a good test. I can somewhat see the argument against "BASELINE" for ground truth - the underlying text might have been BASE(IAKS), for all we know. But, IMO the ground truth should have been "Direction & ess" at the very least. And, more significantly than that - it's a fake scenario, that we don't care for in practice. Why use that? Use invoices with IDs that sound like words but are not. Use license plates and stuff like that. Heck, use large prints of random characters, mixed with handwritten gibberish.

For at least some of images that they used, the expectation from a good text reader is actually to understand context and not blindly OCR. Take "Trader Joe's": we *know* that's an 's', but only from outside context; from OCR, it might've been an 8, there's really no way to tell. Why accept the "s" in ground truth, but reject the full world "Coconut" (which is obviously what is written on the can, even if partially obscured)? Furthermore, a human would know what kind of products are sold by Trader Joe's, and coupling that with the top of the letters "M I L" that are visible, would deduce that's Coconut Milk. So really, Claude nailed that one.

link

8organicbits 488 days ago

I think there are multiple possible goals we could imagine in text recognition tasks. Should the AI guess the occluded text? That could be really helpful in some instances. But if the goal is OCR, then it should only recognize characters optically, and any guessing at occluded characters is undesired.

link

abecedarius 488 days ago

Maybe a better goal is some representation for "COCONUT [with these 3 letters occluded]". Then the consumer might combine this with other evidence about the occluded parts, or review it if questions come up about how accurate the OCR was in this case.

link

bufferoverflow 488 days ago

In the very first example (occluded text) the "ground truth" is just incorrect.

link

dimatura 488 days ago

Re: reviewers, I don't see any mention of this being accepted into a peer-reviewed venue. Peer review isn't necessary for arxiv submissions.

link

alberto-m 487 days ago

You are right, I missed that.

link

Vt71fcAqt7 488 days ago

>reviewers apparently approved that.

What reviewers?

link