|
|
|
|
|
by tgv
1589 days ago
|
|
I see Dutch performs badly. I wouldn't be surprised if that's because of bad/noisy training data. Dutch web content contains an awful amount of English, which pollutes recognition. Cross-check the Dutch tokens with an English dictionary to be sure (although there is quite some overlap for frequent words, e.g. "is", "we", "are", "have", "bent", "had", "brief", etc., and rare ones like "keeshond"). BTW, the test statistic for recognizing individual words isn't interesting, unless you sample/weigh by word frequency. |
|