| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by vanderZwan 1298 days ago
	Honest question: there's plenty of articles on wikipedia where different language versions of a page are vastly different (it feels like the majority in my experience, but that's no proof of course), how would that be useful as training data unless heavily curated?

1 comments

jelmervdl 1298 days ago

The datasets these models are trained on are sentence pairs. So even if just a couple of sentences between two wikipedia sites are direct translations of each other, they will have appeared in the training set. They don’t have to have appeared on the same topic page, it could be that English Wikipedia has a whole category for a topic while Estonian Wikipedia has just a long single page, direct translations will still be identified and used in training.

I also think that the domain and the type of language used on Wikipedia is pretty consistent which will help a lot with unseen sentences.

By no means are these models bad! It’s just that Wikipedia is a particularly easy test for them.

link

kevincox 1298 days ago

How are these identified? Are they human curated? If not it seems like you need a translator to decide if they are equivalent sentence pairs to build your translator.

link

jelmervdl 1298 days ago

You're pretty much right on the money. For ParaCrawl[1] (which I worked on) we used fast machine translation systems that were "good enough" to translate one side of each pair to the language of the other, see whether they'd match sufficiently, and then deal with all the false positives through various filtering methods. Other datasets I know of use multilingual sentence embeddings, like LASER[2], to compute the distance between two sentences.

Both of these methods have a bootstrapping problem, but at this point in the MT for many languages we have enough data to get started. Previous iterations of ParaCrawl used things like document structure and overlap of named entities among sentences to identify matching pairs. But this is much less robust. I don't know how they solve this problem today for low-resource languages.

[1] https://paracrawl.eu

[2] https://github.com/yannvgn/laserembeddings

link