| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by virgilp 490 days ago

I think the goal here was to convince the AI to actually read chars ("OCR") rather than speculate what might be written on paper/in the image. Hence why the ground truth is explicitly removing the letters & word parts that are obscured, even when they can be guessed.

TBH, I'm not sure it's a good test. I can somewhat see the argument against "BASELINE" for ground truth - the underlying text might have been BASE(IAKS), for all we know. But, IMO the ground truth should have been "Direction & ess" at the very least. And, more significantly than that - it's a fake scenario, that we don't care for in practice. Why use that? Use invoices with IDs that sound like words but are not. Use license plates and stuff like that. Heck, use large prints of random characters, mixed with handwritten gibberish.

For at least some of images that they used, the expectation from a good text reader is actually to understand context and not blindly OCR. Take "Trader Joe's": we *know* that's an 's', but only from outside context; from OCR, it might've been an 8, there's really no way to tell. Why accept the "s" in ground truth, but reject the full world "Coconut" (which is obviously what is written on the can, even if partially obscured)? Furthermore, a human would know what kind of products are sold by Trader Joe's, and coupling that with the top of the letters "M I L" that are visible, would deduce that's Coconut Milk. So really, Claude nailed that one.