| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by andreasvc 4273 days ago

I wonder why you chose to use trigrams. Trigrams require a lot more data to get representative statistics, and the UDHR isn't big enough for that (you'd want a million words). Furthermore, I would think that for the task of language detection, a simple unigram model would be sufficient; i.e., just choose the language with the maximum recall of word tokens from its corpus with respect to the input. I think this would work better on short sentences as well.

And excluding the preamble doesn't make your test more meaningful, the text is still in a very particular domain and writing style.

EDIT: I had assumed trigrams meant word trigrams; character trigrams are a good choice for this.

1 comments

wooorm 4273 days ago

Thanks ;)

link

andreasvc 4273 days ago

BTW, here is my implementation of this idea: https://github.com/andreasvc/disco-dop/blob/master/web/parse...

I haven't it tested on more than 3 languages so it might perform badly but I have the intuition that it is easier to get good coverage of the vocabulary of languages than to get the frequencies of something like the top character n-grams right. The latter is affected by authorship and genre of text &c.

link