|
Google OCR is definitely not the same as Tesseract, although it's true that Tesseract is maintained by Google. Google OCR has definitely much higher accuracy and is significantly faster (basically always taking 1s for inference, while Tesseract can easily take 10s or more for dense pages). Source: I work in developing a competing OCR service and we keep an eye on competition (e.g. aside from Google, solutions by Azure, Amazon, Abbyy, Nuance, Cloudmersive, etc., as well as our internal product of course, which is not available externally), and they are (almost) all significantly better on Tesseract. The only domain where Tesseract is competitive is for perfect "black text on white paper", it gives pretty poor performance when dealing with colored, distorted text, or even strong page structure effects (tables, etc.). When I say "pretty poor" I mean: "with respect to the state-of-the-art", of course it's still enormously better than what was the state-of-the-art before deep learning came into the picture, roughly a decade ago. And for things like "search contents of a book" it's basically perfect already. |
Great. How do you quantify it and keep track? Is there an industry standard benchmark?
Would you consider sharing a backblaze type analysis (they track consumer HD performance and blogging about it got them a lot of attention and customers)?