Fast and accurate language identification using fastText

Y	Hacker News new \| ask \| show \| jobs

	Fast and accurate language identification using fastText (fasttext.cc)
	46 points by exgrv 3184 days ago

5 comments

allan_s 3184 days ago

Nice the author created this based on tatoeba.org data, I used to be the main developer and for tatoeba I created a language detector (because it's was painful for people to have to input a sentence AND the language, especially for polyglots), so it's more likely the language data used for this language detector was made itself by a language detector, funny when you think about it :)

https://github.com/allan-simon/Tatodetect (I should rewrite it in Rust some days) , it's a simple N-gram detector.

link

matthberg 3184 days ago

Really fascinating from a linguistics perspective, I'm curious as to how this works and if it is possible to abstract away to help with the cataloguing of dying languages.

link

wyldfire 3184 days ago

I think it would be cool to see how easily they could create a WASM/asm.js target.

link

visarga 3184 days ago

Why is it just 93% accurate on Wikipedia? Is it that hard to identify languages?

link

microcolonel 3184 days ago

I suspect it's due to mixed-language content on Wikipedia. A lot of Wikipedia articles talk about foreign language art and culture, this is one of the largest (if not the largest) single categories of content on non-English Wikipedias.

link

alexott 3180 days ago

Yes, it's not so good on the samples with several languages

link

alexott 3184 days ago

I would try to make comparison with Google's CLD tomorrow

link

alexott 3180 days ago

http://alexott.blogspot.de/2017/10/evaluating-fasttexts-mode...

link

microcolonel 3184 days ago

Bearing in mind that fastText supports many more languages than CLD.