|
|
|
|
|
by jelmervdl
1298 days ago
|
|
The datasets these models are trained on are sentence pairs. So even if just a couple of sentences between two wikipedia sites are direct translations of each other, they will have appeared in the training set. They don’t have to have appeared on the same topic page, it could be that English Wikipedia has a whole category for a topic while Estonian Wikipedia has just a long single page, direct translations will still be identified and used in training. I also think that the domain and the type of language used on Wikipedia is pretty consistent which will help a lot with unseen sentences. By no means are these models bad! It’s just that Wikipedia is a particularly easy test for them. |
|