Hacker News new | ask | show | jobs
by acdha 3843 days ago
Is the problem really Tesseract or the fact that it doesn't have a robust front-end performing segmentation, de-skewing, better binarization, etc? I've heard that Google Books is actually using the Tesseract engine but has seen better results in part from better training but mostly from a more advanced system breaking each page into the blocks of text which are actually OCRed.