| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by redox_ 4592 days ago
	You should also consider full-non-ambiguous words before trying with trigrams. "marché" is only available in French, whereas "mar", "arc", ... are available in lots of languages. This should drastically improve your results.

1 comments

redox_ 4592 days ago

Store only the top N common non-ambiguous words if the RAM consumption matters ;)

link

microtonal 4592 days ago

Or store the lexicon in a determinisitic acyclic finite state automaton. E.g. (shameless plug):

https://github.com/danieldk/dictomaton

Though, having implemented a language guesser myself, it's only an issue with very short texts (a few words). On longer texts models based on character n-grams achieve very high accuracies.

link