| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by abc-1 399 days ago
	Anything that mentions tesseract is about 10 years out of date at this point.

4 comments

fxtentacle 398 days ago

Quite simply, you’re completely wrong. Modern tesseract versions include a modern LSTM AI. It can very affordably be deployed on CPU, yet its performance is competitive with much more expensive large GPU-based models. Especially if you handle a high volume of scans, chances are that tesseract will have the best bang per buck.

ianhawes 398 days ago

My company probably spent close to 6 figures overall creating Tesseract 5 custom models for various languages. Surya beats them all and is open source (and quite faster).

booder1 398 days ago

Surya weights for the models are licensed cc-by-nc-sa-4.0. They have an exception for small companies. If you're company is not small you either need to pay them or use them illegally.

Their training code and data is closed source. They are barely open weight and only inference is open source.

nicman23 398 days ago

i remember that you could not train it your self in a font like you could in older versions, it that still the case?

booder1 399 days ago

5.5.0 released November last year. Still a very active project as far as I can tell and runs on CPU. Even compared to best open source GPU option it is still pretty good. VLMs work very differently and don't work as well for everything. Why is it out of date?

cbsmith 398 days ago

I don't know that that is true: https://researchify.io/blog/comparing-pytesseract-paddleocr-...

Using Surya gets you significantly better results and makes almost all the work detailed in the article largely unnecessary.

booder1 398 days ago

Surya weights for the models are licensed cc-by-nc-sa-4.0 so not free for commercial usage. Also, as far as I know, the training data is 100% unavailable. Given they use well trained, but standard models, it isn't really open source and barely, maybe, open weight. I kinda hate how their repo says gpl cause that is only true for the inference code. The training code is closed source.

cbsmith 397 days ago

I did not know that the training code is closed source. That is troubling.

amelius 399 days ago

Well, at least I can apt-get install tesseract.

That doesn't hold for any of the GPU-based solutions, last time I checked.

krapht 398 days ago

I just built a pipeline with tesseract last year. What's better that is open source and runnable locally?

VLLM hallucination is a blocker for my use case.

criddell 398 days ago

If you are stuck with open source, then your options are limited.

Otherwise I'd say just use your operating system's OCR API. Both Windows and MacOS have excellent APIs for this.

stavros 398 days ago

How is a hallucination worse than a Tesseract error?

krapht 398 days ago

Because the VLM doesn't know it hallucinated. When you get a Tesseract error you can flag the OCR job for manual review.

jgalt212 398 days ago

Hallucinations are hard to detect unless you are a subject-matter expert. I don't have direct experience with Tesseract error detection.

gessha 398 days ago

Latter is more likely to get debugged.

amelius 398 days ago

It could hallucinate obscene language, something which is less likely with classic OCR.