| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by cfcef 3843 days ago
	I hope not. Tesseract delivers bad results on high quality scans, far below the same OCR quality achieved by services like Google Books. What the OCR market needs is someone who will bring that level of OCR quality - or better - to the masses (perhaps some deep learning grad student with time to kill?), not yet another wrapper around Tesseract. We have those already!

3 comments

jiaweihli 3843 days ago

Have you looked into ocropy[0]?

Here's a nice intro[1] that later talks about how it achieves higher accuracy using an LSTM model[2].

[0] https://github.com/tmbdev/ocropy

[1] http://www.danvk.org/2015/01/09/extracting-text-from-an-imag...

[2] http://www.danvk.org/2015/01/11/training-an-ocropus-ocr-mode...

link

cfcef 3843 days ago

I have not. It sounds interesting but raw and unsuitable for end-users. I hope the quality improves and they can get it packaged up in a way that existing document scanners can plug into easily.

link

jahewson 3843 days ago

Note that the primary author of ocropy (formerly ocropus) works at Google.

link

acdha 3843 days ago

Is the problem really Tesseract or the fact that it doesn't have a robust front-end performing segmentation, de-skewing, better binarization, etc? I've heard that Google Books is actually using the Tesseract engine but has seen better results in part from better training but mostly from a more advanced system breaking each page into the blocks of text which are actually OCRed.

link

mynewtb 3843 days ago

I have had great result using tesseract via gimagereader. Are you sure your configuration is good?

link

random778 3843 days ago

Possible to upload an example image + result?

link