| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by visarga 1653 days ago
	It's an old idea, using a language model on top of character level OCR. Works well for general text but doesn't solve random sequences of digits and letters. So you can't use it to correct your invoices where you have lots of out-of-dictionary tokens.

3 comments

bloak 1653 days ago

I've always found it somehow ironic that a human can correctly recognise printed characters even if parts of them are missing and the word is misspelt or in a language the human does not know at all, but computers have to resort to language models because an exact comparison of part of the image with other parts of the image (where the same letter is printed in the same font) for some reason is not feasible?

link

Super_Jambo 1653 days ago

Humans must be using a language model for image recognition when reading though. Otherwise things like failing to spot the the is a duplicate word wouldn't happen so often.

link

nephanth 1653 days ago

Another interesting quirk of human reading is that it is pretty agnostic to the actual order of letters in a word as long as the first and last one are placed correctly

link

Tcepsa 1653 days ago

I see what you did there (and I appreciate it! ^_^)

link

The-Bus 1653 days ago

It took your comment for me to realize what happened

link

Retric 1653 days ago

I think people use font models rather than just language models. The post office is perhaps the best example where despite a known list of addresses and intended format they still backup OCR with people.

link

AussieWog93 1653 days ago

> The post office is perhaps the best example where despite a known list of addresses and intended format they still backup OCR with people.

To be fair, the amount of people who write their own address incorrectly is staggering.

I'm in eCommerce and easily 70%+ of addresses have some sort of minor error in them. Around 5-10% are just plain bizarre, with things like two suburbs being included or the street name not including the Rd/Av/Dr etc.

I'd suspect this is one of those problems that seems easy in the lab but quickly degenerates when you consider the human aspect of it.

link

IggleSniggle 1653 days ago

People do things like include two suburbs because they know that (eg) in order for it to reach their address correctly, it must first reach the human at point A that will correctly pass it to the human at point B, where it otherwise cannot arrive at point B because point B does not receive normal postal service, possibly because of a jurisdiction dispute in 1974 that placed their address in a zip code that’s different from all of their neighbors.

Can you tell that I know a bunch of people with these kinds of issues? You may already know that programmers mess up names all the time, telling people that their last name must be their surname, or that you can’t be an O’Reilly or a Robertson-Peele, or that “Mary Anne” is not a single name, or that your middle name cannot be your “primary” name, or that you must have more than one name, or that your legal name is invalid because it’s not the name you were born with, or that all 3yos have names, etc.

Well, take all of those issues, and add the vagaries of geography, and you’ve got mail delivery.

link

mauvehaus 1653 days ago

I have a weird address, and easily 30% of websites insist on fucking it up by applying validation rules that might make sense from 20,000 feet, but don't actually work in practice for our address.

The most straightforward of them is that some validation services insist that our ZIP code is for the next town over instead of the one we live in, which has its own post office. Nothing correct happens if our mail goes to the wrong post office because they (rightly) have no idea how to deliver mail to us.

I wouldn't be so confident that 100% of that 70% don't know their own address. For at least some of those cases, I'm willing to bet they know something you don't about the vagaries of mail and package delivery to their address.

link

Retric 1653 days ago

Modern postal OCR is generally good enough to detect bad addresses, but my point was people still beat it when the domain is so constrained.

link

throwawayboise 1653 days ago

Brains do patten recognition much better than computers (albeit slower)

link

Andrew_nenakhov 1653 days ago

For now.

link

einpoklum 1653 days ago

> using a language model on top of character level OCR

But if you know you're going to use a language model after the OCR, then you don't OCR to a single character, but rather to a distribution of character similarity (e.g. the 90% least similar or clipping at a certain similarity threshold). Then the language model should have more to work with (although TBH its work becomes more complicated).

link

georgecmu 1653 days ago

If a dictionary satisfies your definition of a language model, yes, with predictably poor results[1]. If I understand correctly, Google Books approach[2] represented a major improvement in accuracy of automated OCR (and this is for printed text!), but I would venture to say that implementing a language model like this would be far beyond the scope of a 'tiny project'.

[1] https://tesseract-ocr.github.io/docs/Limits_on_the_Applicati...

[2] https://tesseract-ocr.github.io/docs/Improving_Book_OCR_by_A...

link