| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by jules 4274 days ago
	Seems like this just compares the L1 distance of the trigram count vector to some preselected document in each language. That won't be very accurate. A much better way to go here is naive bayes. There are more sophisticated approaches but naive bayes will get you much further than this already. If you train this with wikipedia articles for the most popular languages you would most likely get >99% accuracy.

2 comments

breuderink 4271 days ago

One method that I have used in the past was über-simple, yet extremely effective. It exploits ZIP compression, based on the the insight/assumption that two concatenated texts compress beter when they share their language.

I think I found it in this paper [1]. The implementation was like 13 lines of Python code. I wonder how it would compare.

[1] http://www.ccs.neu.edu/home/jaa/CSG399.05F/Topics/Papers/Ben...

link

wooorm 4271 days ago

It’s a very interesting idea. Would it work accurate enough when scaled to 160+ languages?

link

breuderink 4271 days ago

I don't know, I think I used about 40 languages. The beauty is that zip-compression captures rich statistical properties of the languages, so representation-wise it should come a long way. But counting compressed output length discretises the lang-lang distance. For shorter text this might be troubling, since this could easily result in ties. So, maybe. Perhaps I should try :).

link

wooorm 4271 days ago

Perhaps you should ;) If, I’d be interest to know how it goes!

link

wooorm 4273 days ago

One of franc’s focusses was to be pretty small, and usable on the client-side, that’s why no actual training is done and this simple method is used.

Also, I’m interest in a test-suite, before we start talking about accuracy-percentages :p

link

jules 4273 days ago

Naive bayes is just a couple of lines of code (less than what you have now for sure). The trained model is not bigger than what you currently have either.

Although I don't need a test suite to confidently say that L1 distance is going to be worse than naive bayes, it would indeed be good if you had a test suite!

We already have a test suite with 1 example: https://news.ycombinator.com/item?id=8405180

Naive bayes would never get something like that wrong.

link

wooorm 4273 days ago

Franc seems to work well on longer passages. Such as these: https://github.com/wooorm/franc/blob/master/spec/fixtures.js...

It’s interesting though, I’ll take a look at it!

link

jules 4273 days ago

Firstly, that looks like text from the UDHR. Any method will do well on the text that it was trained on, so that will not be representative of real world performance. Secondly, any method will do well on longer passages. If you want to do a more real world test you should pick sentences from an independent source (e.g. wikipedia).

link

wooorm 4273 days ago

I’ll investigate this, but I think I excluded the preamble’s for trigram creation. Sure, the words will be a bit similar, but it’ll be a lot of work to compile 380 fixtures from other sources.

I’ll investigate that too. But it’s lots of work, this already was, give me some time :)

link

andreasvc 4273 days ago

I wonder why you chose to use trigrams. Trigrams require a lot more data to get representative statistics, and the UDHR isn't big enough for that (you'd want a million words). Furthermore, I would think that for the task of language detection, a simple unigram model would be sufficient; i.e., just choose the language with the maximum recall of word tokens from its corpus with respect to the input. I think this would work better on short sentences as well.

And excluding the preamble doesn't make your test more meaningful, the text is still in a very particular domain and writing style.

EDIT: I had assumed trigrams meant word trigrams; character trigrams are a good choice for this.

link