| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by cardigan 3146 days ago

A better baseline than an out of the box OCR engine is applying some computer vision techniques before using an out of the box OCR engine. This can significantly out perform a pure OCR engine approach. (speaking from experience working on startups doing this kind of stuff for ~10 months)

One dumb but surprisingly effective thing to try is apply a bunch of random binarizing filters before putting your text into an OCR engine, then picking the most common output from those filtered images. Some combination of morphological operations (dilates/erodes in different orders and different strengths), different levels of blur and sharpening, adaptive binary threshold at different levels, resizing the image (hilarious but some OCR engines are sensitive to relative scale), adding white borders of different thicknesses, rotating by small angles, denoising, ...

This gave us ~95% word level accuracy on our dataset (with tesseract as the OCR engine), without tuning the filters (we had used the filters individually before with some brittle logic for deciding when to use them: the random filter approach was kind of a joke which ended up working). Tesseract on its own had an abysmal ~40% word level accuracy on that same dataset. Dataset was of words segmented out of scanned bank statements.

Main downside was that this was pretty slow (we were generating ~10k filters per word), but we found a way to make it work (hierarchical stacked AWS Lambdas, lol).

7 comments

Jhsto 3145 days ago

Selfless plug, but for some examples of these see a side project of mine: https://www.juusohaavisto.com/northern-nike-nabob.html

link

justonepost 3145 days ago

Or post processing with domain specific knowledge. eg, https://www.taggun.io/ using google vision for receipts and then applying some ml, nlp, etc.

link

rb2018 3145 days ago

...and the result was better than what you could buy from Google cloud vision or ocr.space out of the box?

link

SophosQ 3145 days ago

Interestingly, When you think about it, the filters in a Convolutional neural net aren't too different from your setup and they provide similar, if not better, performance.

link

ocrcustomserver 3145 days ago

I would be interested in exchanging notes, send me an email if you're up to it.

link

aisofteng 3143 days ago

This is called an ensemble method.

link

wwarner 3146 days ago

c'mon -- that's the easy way!

link