Hacker News new | ask | show | jobs
by sandreas 953 days ago
This. I stumbled over the same problem and did not find the preprocessing too hard.

I achieved pretty good results with a few simple steps before using tesseract:

- Sauvola adaptive thresholding (today there are many better algorithms, but sauvola is still pretty good)

- Creating Histogram based Patches for analysing what parts are text and what parts are images (similar to JBIG2)

I even once found a paper using an algorithm for detecting text-line slopes on geographical maps that was simple, fast and pure genius for curved text lines and then implemented a pixel mapper to correct these curved text lines. Unfortunately the whole project got lost somewhere in the NAS. Maybe I still have it somewhere, but Java was not the best language to implement this :-)

However, I think that even if I found a simple solution for some of my use cases - the whole OCR topic is pretty hard to generalize. Algorithms that work for specific use cases in specific countries don't work for others. And it is lots of hard work to capture all the fonts, typography, edge cases and performance problems in one piece of software.