Hacker News new | ask | show | jobs
by donutdan4114 4592 days ago
"test it out" comes back as french...
1 comments

Maybe you've fallen in the 1% error rate ?
Language guessing is rather hard when few letters are used especially if you use statistical methods. I think after 20 something letters you enter >%95 accuracy zone. In a simple library I wrote ( https://github.com/ahmetaa/zemberek-nlp/tree/master/lang-id Works for 60 languages but no docs yet) , for Turkish and English test results are:

For 20 letters

TR=95.90 EN=94.96

For 50 Letters

TR=99.44 EN=99.53

If 50 letters are used in a Doc, it identifies about 20000 docs per second in a decent desktop.