| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by tgv 1589 days ago
	I see Dutch performs badly. I wouldn't be surprised if that's because of bad/noisy training data. Dutch web content contains an awful amount of English, which pollutes recognition. Cross-check the Dutch tokens with an English dictionary to be sure (although there is quite some overlap for frequent words, e.g. "is", "we", "are", "have", "bent", "had", "brief", etc., and rare ones like "keeshond"). BTW, the test statistic for recognizing individual words isn't interesting, unless you sample/weigh by word frequency.

1 comments

yorwba 1588 days ago

My guess is that it has trouble distinguishing between Afrikaans and Dutch, Indonesian and Malay, and other similar pairs.

link

tgv 1588 days ago

Overlap with those is not particularly large. n-grams are particularly sensitive to spelling, and e.g. Afrikaans writes "Hy het skool toe gegaan", whereas it would be "Hij is naar school gegaan" in Dutch.

Indonesian and Malay insert vowels in consonant clusters and replace quite a few consonants, so they should be easily distinguishable from Dutch, even on Dutch loan words (which are not that frequent anyway).

Dutch has a much larger overlap with German (probably the largest), but even those can be distinguished (by a human) with just a few words of a meaningful sentence. I find it difficult to come up with three words that could be a grammatical fragment in both languages, but even then I expect the n-gram frequencies to be quite diverging.

link

yorwba 1588 days ago

It's on single-word detection where the accuracy for Afrikaans and Dutch is between 50% and 60%. Understandable, considering "het" and "toe" are also Dutch words and "is" is also Afrikaans.

I meant that Indonesian and Malay would be difficult to distinguishing from each other.

link