It depends on your setup and use cases. There's three major considerations:
* What language are you trying to OCR? And only language or also things like math symbols?
* Do you have a GPU or not?
* Are you trying to OCR handwriting or typed words?
I explored OCRing English documents from the 1960s that were primarily typed, though some handwriting. I tried out PaddleOCR, TrOCR, Tesseract, EasyOCR, and kerasOCR for FOSS, and then Google, Amazon, and Microsoft for paid.
To be clear, the paid solutions beat the FOSs ones handsdown, no question. However for FOSS I found that TrOCR was the best for both typed and handwritten, however for typed, it was closely followed by tesseract, but for handwriting TrOCR was by far the best with all the others basically being worthless. However, TrOCR took ~200x longer even on GPU than Tesseract on CPU (Tesseract if fastttt, even more if you parallalerize it). Tesseract isn't the best, but it's the best all around, it's the one the Internet Archive uses.
Need to write up a blog on this. And the docTR looks interesting, I'm going to check that out.
FWIW, I just tried EasyOCR on some sans-serif text and Tesseract5 absolutely blew it out of the water. The only thing Tesseract got wrong that EasyOCR (sometimes) got right was uppercase Is ("I") were recognized pretty much 100% of the time as vertical bars ("|"), but since my text of interest is extremely unlikely to have any vertical bar characters, a simple sed post-processing stage fixed that.
- Tesseract5 *demolished EasyOCR on paragraph detection, getting that 100% on the 10 pages I checked. EasyOCR missed most of the paragraph breaks.
- Tesseract got most of the punctuation correct, EasyOCR only got apostrophes and two double-quotes (out of 14) correct. Every single period, comma, exclamation mark, and hyphen was missing or wrong, as were most of the double-quotes. Some question marks were recognized, but with garbage after them.
- In general EasyOCR seems to just add in square closing brackets ("]") where none are
* What language are you trying to OCR? And only language or also things like math symbols? * Do you have a GPU or not? * Are you trying to OCR handwriting or typed words?
I explored OCRing English documents from the 1960s that were primarily typed, though some handwriting. I tried out PaddleOCR, TrOCR, Tesseract, EasyOCR, and kerasOCR for FOSS, and then Google, Amazon, and Microsoft for paid.
To be clear, the paid solutions beat the FOSs ones handsdown, no question. However for FOSS I found that TrOCR was the best for both typed and handwritten, however for typed, it was closely followed by tesseract, but for handwriting TrOCR was by far the best with all the others basically being worthless. However, TrOCR took ~200x longer even on GPU than Tesseract on CPU (Tesseract if fastttt, even more if you parallalerize it). Tesseract isn't the best, but it's the best all around, it's the one the Internet Archive uses.
Need to write up a blog on this. And the docTR looks interesting, I'm going to check that out.