Hacker News new | ask | show | jobs
by resonance1994 1677 days ago
Tamil is not an Indo-Aryan language though? It's a Dravidian language
2 comments

A big problem with this kind of massively multilingual machine learning research is that the researchers in question know almost nothing about most of the languages they're dealing with. They also grouped Malayalam with Malay. (Though they also say that they focused on languages that get the most translation requests, so maybe this is down to users getting confused about which language they want.)

Their parallel sentence mining project LASER also has problems that are obvious when you know the languages involved. Some time ago I looked at their most confident matches for English-Chinese and briefly thought I was looking at the least confident ones, because Bible quotes were paired with random snippets in Classical Chinese. I think their embedding model was confused by the archaic language.

So I'm glad they also used human evaluators and not just BLEU scores, but I'd've really liked to see a human evaluation of their training data. I think it's possible that the model can average out noise to produce better garbage when you put garbage in, but it might also get completely confused and produce worse garbage. With their testing setup, it's impossible to tell whether more data or better data is needed to improve the performance of this model.

Some of the assumptions about language in this paper are just total junk lol... this one is particularly good "...and for the rest, overlapping vocabulary is a good proxy for similar languages" - this is so wrong I don't even know where to start. The grouping of language by family is also bizarre, the genetic groupings they give for each language are at all sorts of different levels. They say that cultural and geographic proximity was also a factor in grouping, but e.g. the Mongolic and Kra-Dai families have essentially nothing in common apart from the fact the people who speak them look sort of similar to a European. Grouping the Afroasiatic languages Somali and Amharic with the Niger-Congo set also seems like the only criterion was the physical appearance of the speakers...

There is also no way for a reader of the paper to judge the effectiveness of the algorithm. They cite this evaluation of "semantic accuracy", but nothing about the design of the task, participant selection, example data.

This paper is pretty much junk science. Even the reference section is amateurishly formatted

Haha, nice catch :)

I don't know anything about the situation there, but it might still make sense to group it if it's in the same "linguistic area" (see https://en.wikipedia.org/wiki/Sprachbund ). E.g. the Apertium translator from Northern Saami to Norwegian is very useful since both languages – though from very different families – are spoken in the same country and speakers have had millennia of contact, so there's more translated text available than you'd otherwise expect from such different languages and there's need for more translations.