Hacker News new | ask | show | jobs
Fast and accurate language identification using fastText (fasttext.cc)
46 points by exgrv 3184 days ago
5 comments

Nice the author created this based on tatoeba.org data, I used to be the main developer and for tatoeba I created a language detector (because it's was painful for people to have to input a sentence AND the language, especially for polyglots), so it's more likely the language data used for this language detector was made itself by a language detector, funny when you think about it :)

https://github.com/allan-simon/Tatodetect (I should rewrite it in Rust some days) , it's a simple N-gram detector.

Really fascinating from a linguistics perspective, I'm curious as to how this works and if it is possible to abstract away to help with the cataloguing of dying languages.
I think it would be cool to see how easily they could create a WASM/asm.js target.
Why is it just 93% accurate on Wikipedia? Is it that hard to identify languages?
I suspect it's due to mixed-language content on Wikipedia. A lot of Wikipedia articles talk about foreign language art and culture, this is one of the largest (if not the largest) single categories of content on non-English Wikipedias.
Yes, it's not so good on the samples with several languages
I would try to make comparison with Google's CLD tomorrow
Bearing in mind that fastText supports many more languages than CLD.
depends on the mode, but I've compared with only ~60 languages
Please let us know how it goes.
I just posted a link to blogpost