Hacker News new | ask | show | jobs
by jules 4275 days ago
Naive bayes is just a couple of lines of code (less than what you have now for sure). The trained model is not bigger than what you currently have either.

Although I don't need a test suite to confidently say that L1 distance is going to be worse than naive bayes, it would indeed be good if you had a test suite!

We already have a test suite with 1 example: https://news.ycombinator.com/item?id=8405180

Naive bayes would never get something like that wrong.

1 comments

Franc seems to work well on longer passages. Such as these: https://github.com/wooorm/franc/blob/master/spec/fixtures.js...

It’s interesting though, I’ll take a look at it!

Firstly, that looks like text from the UDHR. Any method will do well on the text that it was trained on, so that will not be representative of real world performance. Secondly, any method will do well on longer passages. If you want to do a more real world test you should pick sentences from an independent source (e.g. wikipedia).
I’ll investigate this, but I think I excluded the preamble’s for trigram creation. Sure, the words will be a bit similar, but it’ll be a lot of work to compile 380 fixtures from other sources.

I’ll investigate that too. But it’s lots of work, this already was, give me some time :)

I wonder why you chose to use trigrams. Trigrams require a lot more data to get representative statistics, and the UDHR isn't big enough for that (you'd want a million words). Furthermore, I would think that for the task of language detection, a simple unigram model would be sufficient; i.e., just choose the language with the maximum recall of word tokens from its corpus with respect to the input. I think this would work better on short sentences as well.

And excluding the preamble doesn't make your test more meaningful, the text is still in a very particular domain and writing style.

EDIT: I had assumed trigrams meant word trigrams; character trigrams are a good choice for this.

Thanks ;)
BTW, here is my implementation of this idea: https://github.com/andreasvc/disco-dop/blob/master/web/parse...

I haven't it tested on more than 3 languages so it might perform badly but I have the intuition that it is easier to get good coverage of the vocabulary of languages than to get the frequencies of something like the top character n-grams right. The latter is affected by authorship and genre of text &c.