| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by yorwba 1679 days ago

A big problem with this kind of massively multilingual machine learning research is that the researchers in question know almost nothing about most of the languages they're dealing with. They also grouped Malayalam with Malay. (Though they also say that they focused on languages that get the most translation requests, so maybe this is down to users getting confused about which language they want.)

Their parallel sentence mining project LASER also has problems that are obvious when you know the languages involved. Some time ago I looked at their most confident matches for English-Chinese and briefly thought I was looking at the least confident ones, because Bible quotes were paired with random snippets in Classical Chinese. I think their embedding model was confused by the archaic language.

So I'm glad they also used human evaluators and not just BLEU scores, but I'd've really liked to see a human evaluation of their training data. I think it's possible that the model can average out noise to produce better garbage when you put garbage in, but it might also get completely confused and produce worse garbage. With their testing setup, it's impossible to tell whether more data or better data is needed to improve the performance of this model.

1 comments

mrbukkake 1679 days ago

Some of the assumptions about language in this paper are just total junk lol... this one is particularly good "...and for the rest, overlapping vocabulary is a good proxy for similar languages" - this is so wrong I don't even know where to start. The grouping of language by family is also bizarre, the genetic groupings they give for each language are at all sorts of different levels. They say that cultural and geographic proximity was also a factor in grouping, but e.g. the Mongolic and Kra-Dai families have essentially nothing in common apart from the fact the people who speak them look sort of similar to a European. Grouping the Afroasiatic languages Somali and Amharic with the Niger-Congo set also seems like the only criterion was the physical appearance of the speakers...

There is also no way for a reader of the paper to judge the effectiveness of the algorithm. They cite this evaluation of "semantic accuracy", but nothing about the design of the task, participant selection, example data.

This paper is pretty much junk science. Even the reference section is amateurishly formatted