Hacker News new | ask | show | jobs
by wooorm 4271 days ago
I’ll investigate this, but I think I excluded the preamble’s for trigram creation. Sure, the words will be a bit similar, but it’ll be a lot of work to compile 380 fixtures from other sources.

I’ll investigate that too. But it’s lots of work, this already was, give me some time :)

1 comments

I wonder why you chose to use trigrams. Trigrams require a lot more data to get representative statistics, and the UDHR isn't big enough for that (you'd want a million words). Furthermore, I would think that for the task of language detection, a simple unigram model would be sufficient; i.e., just choose the language with the maximum recall of word tokens from its corpus with respect to the input. I think this would work better on short sentences as well.

And excluding the preamble doesn't make your test more meaningful, the text is still in a very particular domain and writing style.

EDIT: I had assumed trigrams meant word trigrams; character trigrams are a good choice for this.

Thanks ;)
BTW, here is my implementation of this idea: https://github.com/andreasvc/disco-dop/blob/master/web/parse...

I haven't it tested on more than 3 languages so it might perform badly but I have the intuition that it is easier to get good coverage of the vocabulary of languages than to get the frequencies of something like the top character n-grams right. The latter is affected by authorship and genre of text &c.