Hacker News new | ask | show | jobs
by cfcef 3843 days ago
I hope not. Tesseract delivers bad results on high quality scans, far below the same OCR quality achieved by services like Google Books.

What the OCR market needs is someone who will bring that level of OCR quality - or better - to the masses (perhaps some deep learning grad student with time to kill?), not yet another wrapper around Tesseract. We have those already!

3 comments

Have you looked into ocropy[0]?

Here's a nice intro[1] that later talks about how it achieves higher accuracy using an LSTM model[2].

[0] https://github.com/tmbdev/ocropy

[1] http://www.danvk.org/2015/01/09/extracting-text-from-an-imag...

[2] http://www.danvk.org/2015/01/11/training-an-ocropus-ocr-mode...

I have not. It sounds interesting but raw and unsuitable for end-users. I hope the quality improves and they can get it packaged up in a way that existing document scanners can plug into easily.
Note that the primary author of ocropy (formerly ocropus) works at Google.
Is the problem really Tesseract or the fact that it doesn't have a robust front-end performing segmentation, de-skewing, better binarization, etc? I've heard that Google Books is actually using the Tesseract engine but has seen better results in part from better training but mostly from a more advanced system breaking each page into the blocks of text which are actually OCRed.
I have had great result using tesseract via gimagereader. Are you sure your configuration is good?
Possible to upload an example image + result?