| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by wooorm 4318 days ago
	That’s because Haitians always say that! No, joking, it’s just that because of so may supported languages, the accuracy for very short inputs is extremely low.

2 comments

allan_s 4318 days ago

for that, the way I've done for TatoDetect (which is meant specifically for the task of detecting the language for "one sentence a time" ) is to have a database of N-gram huge enough for a language to be nearly sure to have "them all", so that you can consider that if your text to detect contains a N-gram that your language does not have in database, you can apply a 'decrease score' for the said language.

link

ppod 4318 days ago

A regularized prior would help.

link

wooorm 4318 days ago

I’m also really interested in trying something like this: http://www.slideshare.net/shuyo/short-text-language-detectio... (slide 6). But I’d need a lot of training data, more than UDHR.

link