Hacker News new | ask | show | jobs
by abc-1 399 days ago
Anything that mentions tesseract is about 10 years out of date at this point.
4 comments

Quite simply, you’re completely wrong. Modern tesseract versions include a modern LSTM AI. It can very affordably be deployed on CPU, yet its performance is competitive with much more expensive large GPU-based models. Especially if you handle a high volume of scans, chances are that tesseract will have the best bang per buck.
My company probably spent close to 6 figures overall creating Tesseract 5 custom models for various languages. Surya beats them all and is open source (and quite faster).
Surya weights for the models are licensed cc-by-nc-sa-4.0. They have an exception for small companies. If you're company is not small you either need to pay them or use them illegally.

Their training code and data is closed source. They are barely open weight and only inference is open source.

i remember that you could not train it your self in a font like you could in older versions, it that still the case?
5.5.0 released November last year. Still a very active project as far as I can tell and runs on CPU. Even compared to best open source GPU option it is still pretty good. VLMs work very differently and don't work as well for everything. Why is it out of date?
I don't know that that is true: https://researchify.io/blog/comparing-pytesseract-paddleocr-...

Using Surya gets you significantly better results and makes almost all the work detailed in the article largely unnecessary.

Surya weights for the models are licensed cc-by-nc-sa-4.0 so not free for commercial usage. Also, as far as I know, the training data is 100% unavailable. Given they use well trained, but standard models, it isn't really open source and barely, maybe, open weight. I kinda hate how their repo says gpl cause that is only true for the inference code. The training code is closed source.
I did not know that the training code is closed source. That is troubling.
Well, at least I can apt-get install tesseract.

That doesn't hold for any of the GPU-based solutions, last time I checked.

I just built a pipeline with tesseract last year. What's better that is open source and runnable locally?

VLLM hallucination is a blocker for my use case.

If you are stuck with open source, then your options are limited.

Otherwise I'd say just use your operating system's OCR API. Both Windows and MacOS have excellent APIs for this.

How is a hallucination worse than a Tesseract error?
Because the VLM doesn't know it hallucinated. When you get a Tesseract error you can flag the OCR job for manual review.
Hallucinations are hard to detect unless you are a subject-matter expert. I don't have direct experience with Tesseract error detection.
Latter is more likely to get debugged.
It could hallucinate obscene language, something which is less likely with classic OCR.