Hacker News new | ask | show | jobs
by gingerlime 2757 days ago
I have a personal flow using tesseract to scan docs into searchable PDFs, but it’s not that accurate. One of the main problems is that some (now most?) of the documents are in German since I live in Germany, but some are in English. There’s a way to choose the language but nothing to auto detect as far as I’m aware. I was hoping for some cloud AI service with superior OCR and simple integration or CLI (push a PDF and download one with OCR embedded). Google seems to be too complicated unfortunately... Any tips??
4 comments

If you're running tesseract locally (i.e. not paying per invocation), run it once with EN and count occurrences of the/this/a/any etc, run it again with DE and count occurrences of der/die/das/um/ab/wie, and go from there?

Edit: Hell, even average word length is probably going to be a good indicator since German is so agglutinative. Collect some factors like this and I think you'll be able to build a pretty good classifier.

Good idea. I would take it one step further. I would use a ML-based language detection tool which should return a list of languages and a confidence score. Whichever language has the highest confidence score wins. The FastText project has a good pre-trained model available.
In tesseract, if you want to recognize both English and German you can use option -l deu+eng.

If you want to perform language detection you can do the following:

a. Invoke tesseract with "-l eng".

b. Pass the output text to langdetect [1]. It is a port of Google's language detection library to Python which will give you the probabilities of the languages for a given text.

c. Invoke tesseract with "-l langdetect_output"

Note that langdetect generates 2 character codes (ISO 639-1) whereas tesseract expects 3 character codes (ISO 639-2).

[1]: https://github.com/Mimino666/langdetect

Thanks. Wasn't aware it is possible to combine languages!
If you don't absolutely need the integration/CLI, I recommend FineReader (Standard edition). You can specify that the document can contain text from a set of languages (e.g., German and English) and it will auto-detect appropriately. If you need automation (of import, processing, export), this can be done with FineReader Server (formerly known as Recognition Server), but the pricing is quite high for personal use. FineReader Corporate edition has limited automation -- if sufficient for your needs, the pricing might be much more reasonable. I have used the Standard edition and Recognition Server extensively, but have not used the Corporate edition. If you really want a cloud service, you can make your own with their Cloud SDK or use their FineReader Online, but I also have no experience with these.

As for accuracy, the details of your documents and scanning can matter, but, for normal personal usage, it should be very high.

I've heard good things about FineReader, but I'm using Linux and it doesn't look like it's available, also to automate the scanning workflow (and I can't really justify spending that much of it).
There's ABBYY FineReader Engine CLI for Linux: https://www.ocr4linux.com/
You can try the free ocr api at https://ocr.space/ocrapi
Looks interesting, but the free limitations are too restrictive unfortunately (3 page limit, 1 Mb), and I cannot justify paying this much for the paid option when I probably scan roughly less than 10 documents per month (which can be longer than 3 pages and larger than 1 Mb).