Hacker News new | ask | show | jobs
by wooorm 4270 days ago
One of franc’s focusses was to be pretty small, and usable on the client-side, that’s why no actual training is done and this simple method is used.

Also, I’m interest in a test-suite, before we start talking about accuracy-percentages :p

1 comments

Naive bayes is just a couple of lines of code (less than what you have now for sure). The trained model is not bigger than what you currently have either.

Although I don't need a test suite to confidently say that L1 distance is going to be worse than naive bayes, it would indeed be good if you had a test suite!

We already have a test suite with 1 example: https://news.ycombinator.com/item?id=8405180

Naive bayes would never get something like that wrong.

Franc seems to work well on longer passages. Such as these: https://github.com/wooorm/franc/blob/master/spec/fixtures.js...

It’s interesting though, I’ll take a look at it!

Firstly, that looks like text from the UDHR. Any method will do well on the text that it was trained on, so that will not be representative of real world performance. Secondly, any method will do well on longer passages. If you want to do a more real world test you should pick sentences from an independent source (e.g. wikipedia).
I’ll investigate this, but I think I excluded the preamble’s for trigram creation. Sure, the words will be a bit similar, but it’ll be a lot of work to compile 380 fixtures from other sources.

I’ll investigate that too. But it’s lots of work, this already was, give me some time :)

I wonder why you chose to use trigrams. Trigrams require a lot more data to get representative statistics, and the UDHR isn't big enough for that (you'd want a million words). Furthermore, I would think that for the task of language detection, a simple unigram model would be sufficient; i.e., just choose the language with the maximum recall of word tokens from its corpus with respect to the input. I think this would work better on short sentences as well.

And excluding the preamble doesn't make your test more meaningful, the text is still in a very particular domain and writing style.

EDIT: I had assumed trigrams meant word trigrams; character trigrams are a good choice for this.

Thanks ;)