|
|
|
|
|
by afsina
4592 days ago
|
|
Language guessing is rather hard when few letters are used especially if you use statistical methods. I think after 20 something letters you enter >%95 accuracy zone. In a simple library I wrote ( https://github.com/ahmetaa/zemberek-nlp/tree/master/lang-id Works for 60 languages but no docs yet) , for Turkish and English test results are: For 20 letters TR=95.90 EN=94.96 For 50 Letters TR=99.44 EN=99.53 If 50 letters are used in a Doc, it identifies about 20000 docs per second in a decent desktop. |
|