Hacker News new | ask | show | jobs
by BenjaminN 4272 days ago
Tried "hey how are you?", gives me Haitian first.
1 comments

That’s because Haitians always say that! No, joking, it’s just that because of so may supported languages, the accuracy for very short inputs is extremely low.
for that, the way I've done for TatoDetect (which is meant specifically for the task of detecting the language for "one sentence a time" ) is to have a database of N-gram huge enough for a language to be nearly sure to have "them all", so that you can consider that if your text to detect contains a N-gram that your language does not have in database, you can apply a 'decrease score' for the said language.
A regularized prior would help.
I’m also really interested in trying something like this: http://www.slideshare.net/shuyo/short-text-language-detectio... (slide 6). But I’d need a lot of training data, more than UDHR.