| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by pmarreck 2 hours ago
	my attempts at using AI to do OCR have always resulted in invented artifacts, which is not production feasible. does this suffer from that as well? A simple example is words that are supposed to be in other languages being automatically translated to English, which ruins the effect

3 comments

pbhjpbhj 1 hour ago

You almost don't want [super-]word level ML (ie word-pair/phrase/sentence/document/corpus level).

In transcription, you want near certainty, or you want marking that the word could not be read with certainty - yes, context lets you guess, but you want - for some OCR - to know when it's a guess based on other than the letters in order forming a word.

Example, in a census document on familysearch.com the transcriber "corrected" a name as Joseph. The literal letters in the handwritten document spell Josepth ... and sure enough that's a local variant spelling (Eire).

In another document the writer has used "Joh" as an abbreviation, a [human, I assume] transcriber put that as John ... which is most likely, but happens to be wrong.

Sometimes you care that it's guessed, sometimes you want just the best guess.

link

messe 43 minutes ago

> Eire

A nitpick, because it's often a dogwhistle: but Almost nobody in Ireland calls it that when speaking English. And that's still incorrect in Irish, the correct spelling is Éire.

link

drakmo 1 hour ago

If I would want to achieve 100% recognition results I would combine this method with an image model recreating the original document from the transcribed text and matching the layout. One can do that with using all but the page or paragraph from the document you want to recreate (to avoid recreating the exact passage under test from the image artifact directly). After reconstructing you can do an optical comparison that specifically matches misaligned characters and find the errors. Rinse and repeat. Expensive but it would guarantee 100% recognition.

link

aliljet 48 minutes ago

I'm curious about this. What models/tools have you been using?

link