Hacker News new | ask | show | jobs
by jodent 4273 days ago
Quick test:

  ron? snn
  fra? cat
  swe? nds
  ita? und
  nld? gax
Source:

  var franc = require('franc');
  console.log('ron?', franc('Cate bere ai baut?'));
  console.log('fra?', franc('C\'est quoi le bordel la, putain'));
  console.log('swe?', franc('Jag kanner en bot, hon heter Anna'));
  console.log('ita?', franc('che guai'));
  console.log('nld?', franc('graag gedaan'));
3 comments

Testing a statistical language identifier with texts this short is absurd. If you type in four or five words from

https://en.wikipedia.org/wiki/List_of_English_words_of_Frenc...

…do you expect it to return French or English?

It is not absurd. Generally, if humans can do it, it is a reasonable task for NLP to attempt.

Yes you can present edge cases where there is no definite answer, like the one you cite, but this doesn't mean that the task in general is impossible or useless.

I agree the task is neither impossible nor useless. There’s work to do. Short passages should be supported. I do however think franc does a good job, and adds support for some languages which before today have never (I think) been supported. Franc, certainly, “attempt”s to fix language detection, which I would argue is an AI-complete problem.
Ha! Some very nice examples, I have to say :)

Anyway, You’re completely right. Italian is `und` due to LTE 10 characters, the others are slightly off due to short input too, but the demo (http://wooorm.github.io/franc/) shows the correct languages in the second or third place though!

No it doesn't, still takes French for Catalan (French only comes at third place, after Italian), and Swedish for Dutch. (Arguably those are close languages, but hey, this is why I'm using this, right?)
By `correct language` I mean the language you expect, by `second` and `third` I mean `2.` and `3.` in the previously mentioned demo: http://wooorm.github.io/franc/). I think we’re talking about the same thing!

Anyway, yeah, franc is for language detecting, but it’s optimised for many languages and works best at longer text. It’s a trade-off. For less languages and shorter texts, check out https://github.com/shuyo/ldig

I don't know know about the other languages, but your French test is incorrect. (la => là)

Though I'm sure your test wasn't intended to be insidiously misleading.