Hacker News new | ask | show | jobs
by pbhjpbhj 1 hour ago
You almost don't want [super-]word level ML (ie word-pair/phrase/sentence/document/corpus level).

In transcription, you want near certainty, or you want marking that the word could not be read with certainty - yes, context lets you guess, but you want - for some OCR - to know when it's a guess based on other than the letters in order forming a word.

Example, in a census document on familysearch.com the transcriber "corrected" a name as Joseph. The literal letters in the handwritten document spell Josepth ... and sure enough that's a local variant spelling (Eire).

In another document the writer has used "Joh" as an abbreviation, a [human, I assume] transcriber put that as John ... which is most likely, but happens to be wrong.

Sometimes you care that it's guessed, sometimes you want just the best guess.

1 comments

> Eire

A nitpick, because it's often a dogwhistle: but Almost nobody in Ireland calls it that when speaking English. And that's still incorrect in Irish, the correct spelling is Éire.