Hacker News new | ask | show | jobs
by allan_s 4274 days ago
The project does not seems to state clearly how the detection is made, does it call an external webservice or does it rely on a offline database created at some time?

shameless plug https://github.com/allan-simon/Tatodetect it covers 179 language (actually as much as Tatoeba project does) and it can run offline with explanation on how to generate your own database from a CC-by corpus.

After the advantage of Franc is that it can be used directly as a npm library while Tatodetect is a micro-webservice, and for some edge languages, Tatodetect is certainly not as good as Franc (haven't done yet a test of both to compare)

2 comments

You are completely right, franc doesn’t state how language are detected. The detection is based on (1) unicode-script usage and (2) trigram-counts. Some scripts are only used by one language. Other scripts, such as Cyrillic, come with many more: those are detected by the top 300 trigrams per their corresponding UDHR (Universal Declaration of Human rights, the most translated document).

Shameless plug, the page clearly stated you can fork franc to support 300+ languages ;)

Thanks for the information, don't get me wrong, here I'm not trying to play at who has the biggest. Just that at first without knowing how the data file was generated, I thought the "you can fork to support 300+" languages was a statement like "well provided you find a way to provide the data file for that much languages, which we didn't because doing it is hard/require a huge corpus", but if it's just parsing more UDHR translations, then sure it can easily be forked to reach 300+ languages :)
Yeah, so I’d like to add an easier way to support more, or less, languages through the Node API. Currently, there’s a number (1e6), the amount of speakers of a given language, which is hard-coded in the generation file (I added a link this morning in the statement about forking to the actual line).

If you set that number to 0, or 100,000 and execute `npm prepublish`, your franc supports more languages :) That’s it!

Based on a 2-sec look at the code, it's using a built-in database of trigrams as a predictor of the language.

https://github.com/wooorm/franc/blob/master/lib/data.json

my bad, I've been looking to data folder first and haven’t found anything, I should have tried harder
the question would be where he got the language data

If the original language data is available I'd suggest classifying the trigrams as "high" and "low" frequency, which should improve performance without needing to keep full frequency data.

No full-frequency data is kept, only 300 top-trigrams are identified. A quick through the source also reveals wooorm/trigrams, and wooorm/udhr, as sources!
yes, I meant: keeping full frequency could have been avoided to save space/memory but having two classes high/low could be a good tradeoff.
It’s an interesting thought. I might fiddle on it, but I’m not sure it would work in practice (d’oh). Thanks!