Hacker News new | ask | show | jobs
by philsnow 2758 days ago
If you're running tesseract locally (i.e. not paying per invocation), run it once with EN and count occurrences of the/this/a/any etc, run it again with DE and count occurrences of der/die/das/um/ab/wie, and go from there?

Edit: Hell, even average word length is probably going to be a good indicator since German is so agglutinative. Collect some factors like this and I think you'll be able to build a pretty good classifier.

1 comments

Good idea. I would take it one step further. I would use a ML-based language detection tool which should return a list of languages and a confidence score. Whichever language has the highest confidence score wins. The FastText project has a good pre-trained model available.