Hacker News new | ask | show | jobs
by piotrrojek 1088 days ago
What's the best for OCR currently that I could deploy for myself? Like time I tried Tesseract 5 I wasn't too impressed.
2 comments

It depends on your setup and use cases. There's three major considerations:

* What language are you trying to OCR? And only language or also things like math symbols? * Do you have a GPU or not? * Are you trying to OCR handwriting or typed words?

I explored OCRing English documents from the 1960s that were primarily typed, though some handwriting. I tried out PaddleOCR, TrOCR, Tesseract, EasyOCR, and kerasOCR for FOSS, and then Google, Amazon, and Microsoft for paid.

To be clear, the paid solutions beat the FOSs ones handsdown, no question. However for FOSS I found that TrOCR was the best for both typed and handwritten, however for typed, it was closely followed by tesseract, but for handwriting TrOCR was by far the best with all the others basically being worthless. However, TrOCR took ~200x longer even on GPU than Tesseract on CPU (Tesseract if fastttt, even more if you parallalerize it). Tesseract isn't the best, but it's the best all around, it's the one the Internet Archive uses.

Need to write up a blog on this. And the docTR looks interesting, I'm going to check that out.

EasyOCR is a popular project if you are in an environment where you can use run Python and PyTorch (https://github.com/JaidedAI/EasyOCR). Other open source projects of note are PaddleOCR (https://github.com/PaddlePaddle/PaddleOCR) and docTR (https://github.com/mindee/doctr).
FWIW, I just tried EasyOCR on some sans-serif text and Tesseract5 absolutely blew it out of the water. The only thing Tesseract got wrong that EasyOCR (sometimes) got right was uppercase Is ("I") were recognized pretty much 100% of the time as vertical bars ("|"), but since my text of interest is extremely unlikely to have any vertical bar characters, a simple sed post-processing stage fixed that.

- Tesseract5 *demolished EasyOCR on paragraph detection, getting that 100% on the 10 pages I checked. EasyOCR missed most of the paragraph breaks.

- Tesseract got most of the punctuation correct, EasyOCR only got apostrophes and two double-quotes (out of 14) correct. Every single period, comma, exclamation mark, and hyphen was missing or wrong, as were most of the double-quotes. Some question marks were recognized, but with garbage after them.

- In general EasyOCR seems to just add in square closing brackets ("]") where none are