Hacker News new | ask | show | jobs
by afsina 4592 days ago
Language guessing is rather hard when few letters are used especially if you use statistical methods. I think after 20 something letters you enter >%95 accuracy zone. In a simple library I wrote ( https://github.com/ahmetaa/zemberek-nlp/tree/master/lang-id Works for 60 languages but no docs yet) , for Turkish and English test results are:

For 20 letters

TR=95.90 EN=94.96

For 50 Letters

TR=99.44 EN=99.53

If 50 letters are used in a Doc, it identifies about 20000 docs per second in a decent desktop.