Seems like this just compares the L1 distance of the trigram count vector to some preselected document in each language. That won't be very accurate. A much better way to go here is naive bayes. There are more sophisticated approaches but naive bayes will get you much further than this already. If you train this with wikipedia articles for the most popular languages you would most likely get >99% accuracy.
One method that I have used in the past was über-simple, yet extremely effective. It exploits ZIP compression, based on the the insight/assumption that two concatenated texts compress beter when they share their language.
I think I found it in this paper [1]. The implementation was like 13 lines of Python code. I wonder how it would compare.
I don't know, I think I used about 40 languages. The beauty is that zip-compression captures rich statistical properties of the languages, so representation-wise it should come a long way. But counting compressed output length discretises the lang-lang distance. For shorter text this might be troubling, since this could easily result in ties. So, maybe. Perhaps I should try :).
Naive bayes is just a couple of lines of code (less than what you have now for sure). The trained model is not bigger than what you currently have either.
Although I don't need a test suite to confidently say that L1 distance is going to be worse than naive bayes, it would indeed be good if you had a test suite!
Firstly, that looks like text from the UDHR. Any method will do well on the text that it was trained on, so that will not be representative of real world performance. Secondly, any method will do well on longer passages. If you want to do a more real world test you should pick sentences from an independent source (e.g. wikipedia).
I’ll investigate this, but I think I excluded the preamble’s for trigram creation. Sure, the words will be a bit similar, but it’ll be a lot of work to compile 380 fixtures from other sources.
I’ll investigate that too. But it’s lots of work, this already was, give me some time :)
The project does not seems to state clearly how the detection is made, does it call an external webservice or does it rely on a offline database created at some time?
shameless plug https://github.com/allan-simon/Tatodetect
it covers 179 language (actually as much as Tatoeba project does) and it can run offline with explanation on how to generate your own database from a CC-by corpus.
After the advantage of Franc is that it can be used directly as a npm library while Tatodetect is a micro-webservice, and for some edge languages, Tatodetect is certainly not as good as Franc (haven't done yet a test of both to compare)
You are completely right, franc doesn’t state how language are detected. The detection is based on (1) unicode-script usage and (2) trigram-counts. Some scripts are only used by one language. Other scripts, such as Cyrillic, come with many more: those are detected by the top 300 trigrams per their corresponding UDHR (Universal Declaration of Human rights, the most translated document).
Shameless plug, the page clearly stated you can fork franc to support 300+ languages ;)
Thanks for the information, don't get me wrong, here I'm not trying to play at who has the biggest. Just that at first without knowing how the data file was generated, I thought the "you can fork to support 300+" languages was a statement like "well provided you find a way to provide the data file for that much languages, which we didn't because doing it is hard/require a huge corpus", but if it's just parsing more UDHR translations, then sure it can easily be forked to reach 300+ languages :)
Yeah, so I’d like to add an easier way to support more, or less, languages through the Node API. Currently, there’s a number (1e6), the amount of speakers of a given language, which is hard-coded in the generation file (I added a link this morning in the statement about forking to the actual line).
If you set that number to 0, or 100,000 and execute `npm prepublish`, your franc supports more languages :) That’s it!
the question would be where he got the language data
If the original language data is available I'd suggest classifying the trigrams as "high" and "low" frequency, which should improve performance without needing to keep full frequency data.
No full-frequency data is kept, only 300 top-trigrams are identified. A quick through the source also reveals wooorm/trigrams, and wooorm/udhr, as sources!
Sometimes it gets it almost right: I tried with this piece of text in Catalan (Balear variant) and it classifies it as Portuguese (with Catalan as 2nd option): "I s'horabaixa la deixam passar i me mires tan a prop que me fa mal, que surt es sol i encara plou, que t'estim massa i massa poc, que no sé com ho hem d'arreglar, que som amics, que som amants."
It's strange, because it's pretty different from Portuguese...
It sucks, right? Currently, it’s good at long passages. But for shorter values, the results are pretty poor. The amount of supported languages is just too damn high!
The 60% threshold for the single-language scripts seems to be way low for CJK languages. And your method to calculate the occurrence ratio is flawed.
CJK scripts and languages tend to be relatively more concise (in terms of # of Unicode codepoints) than many other languages, so it is possible that the ratio of CJK scripts over non-CJK scripts can be lower than the average. And the occurrence ratio is currently calculated over the number of characters including non-letters, making the ratio much lower. Maybe the custom threshold per script based on the actual corpus (90th percentile, maybe?) and better occurrence calculation would improve the detection on those languages.
I’m not sure. I don’t know any CJK languages myself. I’d like some test-cases where the current methods do not work, as the example in the Readme seems to work pretty well: `এটি একটি ভাষা একক IBM স্ক্রিপ্ট` is classified as Bengali?
Some examples follow. I've really tested with arbitrary text on the Web, and I agree that they are somewhat marginal examples. (But I do think that Franc's margin for CJK languages is way wide.)
한국어 문서가 전 세계 웹에서 차지하는 비중은 2004년에 4.1%로, 이는 영어(35.8%), 중국어(14.1%), 일본어(9.6%), 스페인어(9%), 독일어(7%)에 이어 전 세계 6위이다. 한글 문서와 한국어 문서를 같은 것으로 볼 때, 웹상에서의 한국어 사용 인구는 전 세계 69억여 명의 인구 중 약 1%에 해당한다.
This text from Korean Wikipedia is about the ratio of Korean documents over all documents in the Internet. Digits distort the overall ratio and Franc doesn't give any candidates (even no "und").
現行の学校文法では、英語にあるような「目的語」「補語」などの成分はないとする。英語文法では "I read a book." の "a book" はSVO文型の一部をなす目的語であり、また、"I go to the library." の "the library" は前置詞とともに付け加えられた修飾語と考えられる。
This text from Japanese Wikipedia concerns about the distinction of objectives and complements in the English syntax. In this bilingual text it looks like that Japanese has reached the 60% threshold but the codepoint count doesn't.
var franc = require('franc');
console.log('ron?', franc('Cate bere ai baut?'));
console.log('fra?', franc('C\'est quoi le bordel la, putain'));
console.log('swe?', franc('Jag kanner en bot, hon heter Anna'));
console.log('ita?', franc('che guai'));
console.log('nld?', franc('graag gedaan'));
It is not absurd. Generally, if humans can do it, it is a reasonable task for NLP to attempt.
Yes you can present edge cases where there is no definite answer, like the one you cite, but this doesn't mean that the task in general is impossible or useless.
I agree the task is neither impossible nor useless. There’s work to do. Short passages should be supported. I do however think franc does a good job, and adds support for some languages which before today have never (I think) been supported. Franc, certainly, “attempt”s to fix language detection, which I would argue is an AI-complete problem.
Anyway, You’re completely right. Italian is `und` due to LTE 10 characters, the others are slightly off due to short input too, but the demo (http://wooorm.github.io/franc/) shows the correct languages in the second or third place though!
No it doesn't, still takes French for Catalan (French only comes at third place, after Italian), and Swedish for Dutch.
(Arguably those are close languages, but hey, this is why I'm using this, right?)
By `correct language` I mean the language you expect, by `second` and `third` I mean `2.` and `3.` in the previously mentioned demo: http://wooorm.github.io/franc/). I think we’re talking about the same thing!
Anyway, yeah, franc is for language detecting, but it’s optimised for many languages and works best at longer text. It’s a trade-off. For less languages and shorter texts, check out https://github.com/shuyo/ldig
+1 indeed, but I think most of people have already a hard time to see why we need to make the difference between country code and language code, and even more that something that people consider as a "dialect" can actually be a totally different language (for example in China a lot of "dialect/fangyang" are actually not dialect of Mandarin, for example Shanghainese (Wu language) and languages from Hunan province)
after you can also try to explan them that the common "represent a language by a flag" becomes quickly broken and subject to strong arguing between people (what flag do you put for Tibetan language for example? or for each of Indian languages)
It would be interesting to see comparisons with language detection libraries written in other languages as well. Not just in terms of runtime, but also accuracy. Actually, it seems like this would be useful as a separate project to help the decision-making process when choosing a library.
Thanks for the pointer. I might decide to whip something up one of these days. I really have no need for language detection, but I just find it interesting and I'm curious to see wich libraries will win out.
"»Butter and cheese« is proper English and proper Fries."
Unfortunately Fries is not supported, but I'd be interested in the results. But I don't think polyglots for natural languages are common, this is in fact the only one I know.
It does have several translations of the bible, though. I guess it would be a lot of work to find bible translations for all those languages - or was there another reason for using the Human Rights Declaration?
Thanks! Currently, the UDHRs are crawled, and I’d rather not include exceptions and maintain their plain-text and XML/JSON versions by hand. If you’re into growing the language, I suggest contacting the Office of the High Commissioner of Human Rights of the UN, and the Unicode project, or fork wooorm/udhr and add support, I’ll merge :)
That’s because Haitians always say that! No, joking, it’s just that because of so may supported languages, the accuracy for very short inputs is extremely low.
for that, the way I've done for TatoDetect (which is meant specifically for the task of detecting the language for "one sentence a time" ) is to have a database of N-gram huge enough for a language to be nearly sure to have "them all", so that you can consider that if your text to detect contains a N-gram that your language does not have in database, you can apply a 'decrease score' for the said language.