| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by efields 2757 days ago

Is off the shelf open source OCR not reliable for an image of reasonable fidelity, like a smartphone camera picture of a B&W text document?

I ask because it feels like I should have an app that lets me scan with my phone, process the text with OCR, then let me plain text search every scanned document I have.

The first part only natively made it into iOS Notes a year or two ago, but that whole experience above should be out of the box, IMHO…

3 comments

Holybeds 2757 days ago

There's a difference between doing OCR and actually understanding what is what in the document content.

For normal text OCR works well. But automatically understanding what is what is more complex.

link

njstraub608 2757 days ago

This ^^

And actually understanding the context of what you're trying to use OCR on can work backward to determine what the text actually is, i.e. if it's a "Name" field then the probabilities of ambiguous letters may change (in the case of handwriting rec).

link

viig99 2757 days ago

No open source ocr doesn't work that great, i work for a telecom company, and we process over millions of documents a month, we built everything in house and now are able to process it at almost 40cents per 1000 documents. It a long process to process huge documents like payslips which require text boundary detection, word identification, spatial clustering and writing parsers (depends on word, segment, and clustering probabilities) which can extract required fields out of the documents.

link

wahnfrieden 2757 days ago

This is an Evernote feature. Dropbox also launched this feature.

link

brad0 2757 days ago

Evernote is an interesting case.

They store every word that MAY be in the scanned document.

So their OCR engine will find a lot of legitimate words, but it will also find a lot of words that don't sense too.

When putting in a term for searching, it looks at the entire index (both legit words and the garbage) and returns you the documents that match.

I think it's quite clever.

Bear in mind that this feature was many years ago, I have no idea if this is still the case.

link

ocrcustomserver 2757 days ago

Yeah, Evernote's OCR engine will generate possible candidates for every given word and will sort them internally by confidence score.

Screenshot: https://s24953.pcdn.co/blog/wp-content/uploads/2018/02/longh...

Since it's not aimed for transcription (user doesn't know what he's looking for) but for retrieval (user knows what he's looking for), it can get away with mistakes.

References:

https://evernote.com/blog/how-evernotes-image-recognition-wo...

https://help.evernote.com/hc/en-us/articles/208314518-How-Ev...

https://evernote.com/blog/evernote-indexing-system/

link

julianz 2757 days ago

Yep it's quite clever for searching for things, much less useful for doing something based on the recognized text.

link

ocrcustomserver 2757 days ago

OneNote can do transcription (copy text from image).

link