|
|
|
|
|
by andreasvc
4273 days ago
|
|
I wonder why you chose to use trigrams. Trigrams require a lot more data to get representative statistics, and the UDHR isn't big enough for that (you'd want a million words). Furthermore, I would think that for the task of language detection, a simple unigram model would be sufficient; i.e., just choose the language with the maximum recall of word tokens from its corpus with respect to the input. I think this would work better on short sentences as well. And excluding the preamble doesn't make your test more meaningful, the text is still in a very particular domain and writing style. EDIT: I had assumed trigrams meant word trigrams; character trigrams are a good choice for this. |
|