Hacker News new | ask | show | jobs
by wooorm 4271 days ago
That’s because Haitians always say that! No, joking, it’s just that because of so may supported languages, the accuracy for very short inputs is extremely low.
2 comments

for that, the way I've done for TatoDetect (which is meant specifically for the task of detecting the language for "one sentence a time" ) is to have a database of N-gram huge enough for a language to be nearly sure to have "them all", so that you can consider that if your text to detect contains a N-gram that your language does not have in database, you can apply a 'decrease score' for the said language.
A regularized prior would help.
I’m also really interested in trying something like this: http://www.slideshare.net/shuyo/short-text-language-detectio... (slide 6). But I’d need a lot of training data, more than UDHR.