|
|
|
|
|
by cardigan
3146 days ago
|
|
A better baseline than an out of the box OCR engine is applying some computer vision techniques before using an out of the box OCR engine. This can significantly out perform a pure OCR engine approach. (speaking from experience working on startups doing this kind of stuff for ~10 months) One dumb but surprisingly effective thing to try is apply a bunch of random binarizing filters before putting your text into an OCR engine, then picking the most common output from those filtered images. Some combination of morphological operations (dilates/erodes in different orders and different strengths), different levels of blur and sharpening, adaptive binary threshold at different levels, resizing the image (hilarious but some OCR engines are sensitive to relative scale), adding white borders of different thicknesses, rotating by small angles, denoising, ... This gave us ~95% word level accuracy on our dataset (with tesseract as the OCR engine), without tuning the filters (we had used the filters individually before with some brittle logic for deciding when to use them: the random filter approach was kind of a joke which ended up working). Tesseract on its own had an abysmal ~40% word level accuracy on that same dataset. Dataset was of words segmented out of scanned bank statements. Main downside was that this was pretty slow (we were generating ~10k filters per word), but we found a way to make it work (hierarchical stacked AWS Lambdas, lol). |
|