Hacker News new | ask | show | jobs
by redox_ 4592 days ago
You should also consider full-non-ambiguous words before trying with trigrams. "marché" is only available in French, whereas "mar", "arc", ... are available in lots of languages. This should drastically improve your results.
1 comments

Store only the top N common non-ambiguous words if the RAM consumption matters ;)
Or store the lexicon in a determinisitic acyclic finite state automaton. E.g. (shameless plug):

https://github.com/danieldk/dictomaton

Though, having implemented a language guesser myself, it's only an issue with very short texts (a few words). On longer texts models based on character n-grams achieve very high accuracies.