Show HN: Franc – Detect natural languages | HN Mirror

Y	Hacker News new \| ask \| show \| jobs

	Show HN: Franc – Detect natural languages (github.com)
	51 points by wooorm 4274 days ago

13 comments

jules 4273 days ago

Seems like this just compares the L1 distance of the trigram count vector to some preselected document in each language. That won't be very accurate. A much better way to go here is naive bayes. There are more sophisticated approaches but naive bayes will get you much further than this already. If you train this with wikipedia articles for the most popular languages you would most likely get >99% accuracy.

breuderink 4271 days ago

One method that I have used in the past was über-simple, yet extremely effective. It exploits ZIP compression, based on the the insight/assumption that two concatenated texts compress beter when they share their language.

I think I found it in this paper [1]. The implementation was like 13 lines of Python code. I wonder how it would compare.

[1] http://www.ccs.neu.edu/home/jaa/CSG399.05F/Topics/Papers/Ben...

wooorm 4271 days ago

It’s a very interesting idea. Would it work accurate enough when scaled to 160+ languages?

breuderink 4271 days ago

I don't know, I think I used about 40 languages. The beauty is that zip-compression captures rich statistical properties of the languages, so representation-wise it should come a long way. But counting compressed output length discretises the lang-lang distance. For shorter text this might be troubling, since this could easily result in ties. So, maybe. Perhaps I should try :).

wooorm 4271 days ago

Perhaps you should ;) If, I’d be interest to know how it goes!

wooorm 4273 days ago

One of franc’s focusses was to be pretty small, and usable on the client-side, that’s why no actual training is done and this simple method is used.

Also, I’m interest in a test-suite, before we start talking about accuracy-percentages :p

jules 4273 days ago

Naive bayes is just a couple of lines of code (less than what you have now for sure). The trained model is not bigger than what you currently have either.

Although I don't need a test suite to confidently say that L1 distance is going to be worse than naive bayes, it would indeed be good if you had a test suite!

We already have a test suite with 1 example: https://news.ycombinator.com/item?id=8405180

Naive bayes would never get something like that wrong.

wooorm 4273 days ago

Franc seems to work well on longer passages. Such as these: https://github.com/wooorm/franc/blob/master/spec/fixtures.js...

It’s interesting though, I’ll take a look at it!

jules 4273 days ago

Firstly, that looks like text from the UDHR. Any method will do well on the text that it was trained on, so that will not be representative of real world performance. Secondly, any method will do well on longer passages. If you want to do a more real world test you should pick sentences from an independent source (e.g. wikipedia).

wooorm 4273 days ago

I’ll investigate this, but I think I excluded the preamble’s for trigram creation. Sure, the words will be a bit similar, but it’ll be a lot of work to compile 380 fixtures from other sources.

I’ll investigate that too. But it’s lots of work, this already was, give me some time :)

allan_s 4273 days ago

The project does not seems to state clearly how the detection is made, does it call an external webservice or does it rely on a offline database created at some time?

shameless plug https://github.com/allan-simon/Tatodetect it covers 179 language (actually as much as Tatoeba project does) and it can run offline with explanation on how to generate your own database from a CC-by corpus.

After the advantage of Franc is that it can be used directly as a npm library while Tatodetect is a micro-webservice, and for some edge languages, Tatodetect is certainly not as good as Franc (haven't done yet a test of both to compare)

wooorm 4273 days ago

You are completely right, franc doesn’t state how language are detected. The detection is based on (1) unicode-script usage and (2) trigram-counts. Some scripts are only used by one language. Other scripts, such as Cyrillic, come with many more: those are detected by the top 300 trigrams per their corresponding UDHR (Universal Declaration of Human rights, the most translated document).

Shameless plug, the page clearly stated you can fork franc to support 300+ languages ;)

allan_s 4273 days ago

Thanks for the information, don't get me wrong, here I'm not trying to play at who has the biggest. Just that at first without knowing how the data file was generated, I thought the "you can fork to support 300+" languages was a statement like "well provided you find a way to provide the data file for that much languages, which we didn't because doing it is hard/require a huge corpus", but if it's just parsing more UDHR translations, then sure it can easily be forked to reach 300+ languages :)

wooorm 4273 days ago

Yeah, so I’d like to add an easier way to support more, or less, languages through the Node API. Currently, there’s a number (1e6), the amount of speakers of a given language, which is hard-coded in the generation file (I added a link this morning in the statement about forking to the actual line).

If you set that number to 0, or 100,000 and execute `npm prepublish`, your franc supports more languages :) That’s it!

hywel 4273 days ago

Based on a 2-sec look at the code, it's using a built-in database of trigrams as a predictor of the language.

https://github.com/wooorm/franc/blob/master/lib/data.json

allan_s 4273 days ago

my bad, I've been looking to data folder first and haven’t found anything, I should have tried harder

riffraff 4273 days ago

the question would be where he got the language data

If the original language data is available I'd suggest classifying the trigrams as "high" and "low" frequency, which should improve performance without needing to keep full frequency data.

wooorm 4273 days ago

No full-frequency data is kept, only 300 top-trigrams are identified. A quick through the source also reveals wooorm/trigrams, and wooorm/udhr, as sources!

riffraff 4273 days ago

yes, I meant: keeping full frequency could have been avoided to save space/memory but having two classes high/low could be a good tradeoff.

wooorm 4273 days ago

It’s an interesting thought. I might fiddle on it, but I’m not sure it would work in practice (d’oh). Thanks!

perlgeek 4273 days ago

What I'd really like to see is code that takes a body of text and extracts parts that are written in another language.

That's quite common, like in mixed-language IRC channels, quotes from English documents in documents mostly written in another language, and so on.

And stemming and indexing such a document for full text search usually gives crappy results.

(Bonus points of detecting programming code samples, so that this part isn't stemmed at all).

wooorm 4273 days ago

That would be awesome :)

grimborg 4273 days ago

Interesting!

Sometimes it gets it almost right: I tried with this piece of text in Catalan (Balear variant) and it classifies it as Portuguese (with Catalan as 2nd option): "I s'horabaixa la deixam passar i me mires tan a prop que me fa mal, que surt es sol i encara plou, que t'estim massa i massa poc, que no sé com ho hem d'arreglar, que som amics, que som amants."

It's strange, because it's pretty different from Portuguese...

The Catalan poem "tirallonga de monosíl·labs" gets classified as French. (http://www.rodamots.com/calaix.asp?text=tirallonga)

wooorm 4273 days ago

It sucks, right? Currently, it’s good at long passages. But for shorter values, the results are pretty poor. The amount of supported languages is just too damn high!

lifthrasiir 4273 days ago

The 60% threshold for the single-language scripts seems to be way low for CJK languages. And your method to calculate the occurrence ratio is flawed.

CJK scripts and languages tend to be relatively more concise (in terms of # of Unicode codepoints) than many other languages, so it is possible that the ratio of CJK scripts over non-CJK scripts can be lower than the average. And the occurrence ratio is currently calculated over the number of characters including non-letters, making the ratio much lower. Maybe the custom threshold per script based on the actual corpus (90th percentile, maybe?) and better occurrence calculation would improve the detection on those languages.

wooorm 4273 days ago

I’m not sure. I don’t know any CJK languages myself. I’d like some test-cases where the current methods do not work, as the example in the Readme seems to work pretty well: `এটি একটি ভাষা একক IBM স্ক্রিপ্ট` is classified as Bengali?

lifthrasiir 4273 days ago

Some examples follow. I've really tested with arbitrary text on the Web, and I agree that they are somewhat marginal examples. (But I do think that Franc's margin for CJK languages is way wide.)

한국어 문서가 전 세계 웹에서 차지하는 비중은 2004년에 4.1%로, 이는 영어(35.8%), 중국어(14.1%), 일본어(9.6%), 스페인어(9%), 독일어(7%)에 이어 전 세계 6위이다. 한글 문서와 한국어 문서를 같은 것으로 볼 때, 웹상에서의 한국어 사용 인구는 전 세계 69억여 명의 인구 중 약 1%에 해당한다.

This text from Korean Wikipedia is about the ratio of Korean documents over all documents in the Internet. Digits distort the overall ratio and Franc doesn't give any candidates (even no "und").

現行の学校文法では、英語にあるような「目的語」「補語」などの成分はないとする。英語文法では "I read a book." の "a book" はSVO文型の一部をなす目的語であり、また、"I go to the library." の "the library" は前置詞とともに付け加えられた修飾語と考えられる。

This text from Japanese Wikipedia concerns about the distinction of objectives and complements in the English syntax. In this bilingual text it looks like that Japanese has reached the 60% threshold but the codepoint count doesn't.

wooorm 4272 days ago

I pushed a fix, incorporating your suggestions, and your examples in the specs.

Thanks a lot!

wooorm 4273 days ago

Oh you’re right. I think I have a fix in mind, will work on it. Thanks so much!

jodent 4273 days ago

Quick test:

  ron? snn
  fra? cat
  swe? nds
  ita? und
  nld? gax

Source:

  var franc = require('franc');
  console.log('ron?', franc('Cate bere ai baut?'));
  console.log('fra?', franc('C\'est quoi le bordel la, putain'));
  console.log('swe?', franc('Jag kanner en bot, hon heter Anna'));
  console.log('ita?', franc('che guai'));
  console.log('nld?', franc('graag gedaan'));

indubitably 4273 days ago

Testing a statistical language identifier with texts this short is absurd. If you type in four or five words from

https://en.wikipedia.org/wiki/List_of_English_words_of_Frenc...

…do you expect it to return French or English?

andreasvc 4273 days ago

It is not absurd. Generally, if humans can do it, it is a reasonable task for NLP to attempt.

Yes you can present edge cases where there is no definite answer, like the one you cite, but this doesn't mean that the task in general is impossible or useless.

wooorm 4273 days ago

I agree the task is neither impossible nor useless. There’s work to do. Short passages should be supported. I do however think franc does a good job, and adds support for some languages which before today have never (I think) been supported. Franc, certainly, “attempt”s to fix language detection, which I would argue is an AI-complete problem.

wooorm 4273 days ago

Ha! Some very nice examples, I have to say :)

Anyway, You’re completely right. Italian is `und` due to LTE 10 characters, the others are slightly off due to short input too, but the demo (http://wooorm.github.io/franc/) shows the correct languages in the second or third place though!

jodent 4273 days ago

No it doesn't, still takes French for Catalan (French only comes at third place, after Italian), and Swedish for Dutch. (Arguably those are close languages, but hey, this is why I'm using this, right?)

wooorm 4273 days ago

By `correct language` I mean the language you expect, by `second` and `third` I mean `2.` and `3.` in the previously mentioned demo: http://wooorm.github.io/franc/). I think we’re talking about the same thing!

Anyway, yeah, franc is for language detecting, but it’s optimised for many languages and works best at longer text. It’s a trade-off. For less languages and shorter texts, check out https://github.com/shuyo/ldig

pierrec 4273 days ago

I don't know know about the other languages, but your French test is incorrect. (la => là)

Though I'm sure your test wasn't intended to be insidiously misleading.

robin_reala 4273 days ago

The supported languages file (https://github.com/wooorm/franc/blob/master/Supported-Langua...) lists Matu Chin as having 182,000,000 speakers. Having never heard of it this surprised me, but the Wikipedia page for it (http://en.wikipedia.org/wiki/Matu_Chin_language) lists 40,000 speakers. Mistake to fix?

wooorm 4273 days ago

You seem to be completely right, I hand-crawled the data (https://github.com/wooorm/speakers), but seem to have made big typo there! Thanks!

mholmes680 4273 days ago

+1 for using those iso codes. I introduced them at work 4 yrs ago, and everyone looked at me like i had ten heads.

allan_s 4273 days ago

+1 indeed, but I think most of people have already a hard time to see why we need to make the difference between country code and language code, and even more that something that people consider as a "dialect" can actually be a totally different language (for example in China a lot of "dialect/fangyang" are actually not dialect of Mandarin, for example Shanghainese (Wu language) and languages from Hunan province)

after you can also try to explan them that the common "represent a language by a flag" becomes quickly broken and subject to strong arguing between people (what flag do you put for Tibetan language for example? or for each of Indian languages)

michaelmior 4273 days ago

It would be interesting to see comparisons with language detection libraries written in other languages as well. Not just in terms of runtime, but also accuracy. Actually, it seems like this would be useful as a separate project to help the decision-making process when choosing a library.

wooorm 4273 days ago

Agreed :)

allan_s 4273 days ago

for the case of "one sentence detection" you can use Tatoeba project database dump http://tatoeba.org/eng/downloads

you have a CSV of iso code => sentence , which should be 99% accurate (as it gets user proofed), so on in which you can compare your tool with.

I think for longer text one could use Wikipedia dump or alike ?

michaelmior 4273 days ago

Thanks for the pointer. I might decide to whip something up one of these days. I really have no need for language detection, but I just find it interesting and I'm curious to see wich libraries will win out.

1ris 4273 days ago

"»Butter and cheese« is proper English and proper Fries."

Unfortunately Fries is not supported, but I'd be interested in the results. But I don't think polyglots for natural languages are common, this is in fact the only one I know.

wooorm 4273 days ago

And it doesn’t have a Universal Declaration of Human rights: http://www.unicode.org/udhr/index_by_name.html

Luc 4273 days ago

It does have several translations of the bible, though. I guess it would be a lot of work to find bible translations for all those languages - or was there another reason for using the Human Rights Declaration?

P.S. Kudos, very cool project!

EDIT: Frisian version should you want it: https://www.google.com/search?q=Yn+betinken+nommen+dat+it+er...

wooorm 4273 days ago

Thanks! Currently, the UDHRs are crawled, and I’d rather not include exceptions and maintain their plain-text and XML/JSON versions by hand. If you’re into growing the language, I suggest contacting the Office of the High Commissioner of Human Rights of the UN, and the Unicode project, or fork wooorm/udhr and add support, I’ll merge :)

wooorm 4273 days ago

Fries as in Frisian? I don’t think it has one million speakers (right?) :p

BenjaminN 4273 days ago

Tried "hey how are you?", gives me Haitian first.

wooorm 4273 days ago

That’s because Haitians always say that! No, joking, it’s just that because of so may supported languages, the accuracy for very short inputs is extremely low.

allan_s 4273 days ago

for that, the way I've done for TatoDetect (which is meant specifically for the task of detecting the language for "one sentence a time" ) is to have a database of N-gram huge enough for a language to be nearly sure to have "them all", so that you can consider that if your text to detect contains a N-gram that your language does not have in database, you can apply a 'decrease score' for the said language.

ppod 4273 days ago

A regularized prior would help.

wooorm 4273 days ago

I’m also really interested in trying something like this: http://www.slideshare.net/shuyo/short-text-language-detectio... (slide 6). But I’d need a lot of training data, more than UDHR.

apierre 4273 days ago

I am using IDOL OnDemand which gives good results too.

melling 4273 days ago

On a slight tangent, are there open source dictionaries that developers can use for app localization, etc?