| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by ocrcustomserver 2764 days ago

In tesseract, if you want to recognize both English and German you can use option -l deu+eng.

If you want to perform language detection you can do the following:

a. Invoke tesseract with "-l eng".

b. Pass the output text to langdetect [1]. It is a port of Google's language detection library to Python which will give you the probabilities of the languages for a given text.

c. Invoke tesseract with "-l langdetect_output"

Note that langdetect generates 2 character codes (ISO 639-1) whereas tesseract expects 3 character codes (ISO 639-2).

[1]: https://github.com/Mimino666/langdetect

1 comments

gingerlime 2762 days ago

Thanks. Wasn't aware it is possible to combine languages!

link