Hacker News new | ask | show | jobs
by cardigan 3146 days ago
A better baseline than an out of the box OCR engine is applying some computer vision techniques before using an out of the box OCR engine. This can significantly out perform a pure OCR engine approach. (speaking from experience working on startups doing this kind of stuff for ~10 months)

One dumb but surprisingly effective thing to try is apply a bunch of random binarizing filters before putting your text into an OCR engine, then picking the most common output from those filtered images. Some combination of morphological operations (dilates/erodes in different orders and different strengths), different levels of blur and sharpening, adaptive binary threshold at different levels, resizing the image (hilarious but some OCR engines are sensitive to relative scale), adding white borders of different thicknesses, rotating by small angles, denoising, ...

This gave us ~95% word level accuracy on our dataset (with tesseract as the OCR engine), without tuning the filters (we had used the filters individually before with some brittle logic for deciding when to use them: the random filter approach was kind of a joke which ended up working). Tesseract on its own had an abysmal ~40% word level accuracy on that same dataset. Dataset was of words segmented out of scanned bank statements.

Main downside was that this was pretty slow (we were generating ~10k filters per word), but we found a way to make it work (hierarchical stacked AWS Lambdas, lol).

7 comments

Selfless plug, but for some examples of these see a side project of mine: https://www.juusohaavisto.com/northern-nike-nabob.html
Or post processing with domain specific knowledge. eg, https://www.taggun.io/ using google vision for receipts and then applying some ml, nlp, etc.
...and the result was better than what you could buy from Google cloud vision or ocr.space out of the box?
Interestingly, When you think about it, the filters in a Convolutional neural net aren't too different from your setup and they provide similar, if not better, performance.
I would be interested in exchanging notes, send me an email if you're up to it.
This is called an ensemble method.
c'mon -- that's the easy way!