| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by btheshoe 1445 days ago
	I'm not entirely sure why low resource languages are seen as such a high priority for AI research. It seems that by definition there's little payoff to solving translation for them.

11 comments

albertzeyer 1445 days ago

I don't really remember the exact numbers anymore, but covering only the top 5 languages will cover maybe 40% of the world population, while covering the top 200 languages (many of them low resource) will cover maybe 90% of the world population.

Some numbers (but you can not exactly infer from them such accumulated numbers): https://en.wikipedia.org/wiki/List_of_languages_by_total_num...

Some more numbers from here: https://www.sciencedirect.com/science/article/pii/S016763931...

"96% of the world’s languages are spoken by only 4% of its people."

Although this statement is more about the tail from the approx 7000 languages.

link

jefftk 1445 days ago

It doesn't sound like you're considering that people are very often fluent in a major language in addition to their regional one?

link

albertzeyer 1445 days ago

I am. That's why I mentioned that you can not infer my statements directly from the numbers you find on Wikipedia etc. You can not simply add up those numbers.

link

goodside 1445 days ago

"Low-resource language" isn't just a euphemism for "language almost nobody speaks". There are many languages that are widely spoken but nonetheless are hard to obtain training data for. Getting something like Wikipedia going for a minority language can be a difficult chicken-and-egg problem because users will use English for its completeness/recency, despite their limited fluency, and the native-language Wikipedia remains neglected. So you can end up in a situation where users use one language for social media and another for news/research, and Facebook is in a unique position to care about the former.

link

cyphar 1445 days ago

Aside from the fact that being able to generalise a model with very little training data is an important AI research problem to solve, language death is a serious concern and is being accelerated due to the fact that many languages are not supported at all by modern technology (leading to "prestige language" pressures that are a known cause of historical language death).

For instance, Icelandic is not supported by any modern smartphone platform which has lead to Icelandic natives communicating with each other in English and very little information is translated to Icelandic[1,2].

That being said, I am worried that having translations that are "too good" could also act to accelerate language death as the importance of keeping languages alive will seem less significant (to non-language-nerds) if we can translate works written in that language to any other language with very small datasets. Luckily I'm not convinced that AI models will be able to produce convincing and consistent translations for a long time -- languages are so different in so many ways that I can't see how adding more dimensions and parameters to a model would account for them.

[1]: https://youtu.be/qYlmFfsyLMo?t=141 [2]: https://www.nytimes.com/2017/04/22/world/europe/iceland-icel...

link

wilde 1445 days ago

The point is that there are lots of humans who speak these languages and use tech. They just don’t use Wikipedia so getting a good translation corpus going was harder.

link

gwern 1445 days ago

And it's both cumulative across all those languages (see above), cheap/amortized (if you can do a good multilingual NMT for 50 languages, how hard can 50+1 languages be?), and many of those languages are likely to grow both in terms of sheer population and in GDP. (Think about South Asian or African countries like Indonesia or Nigeria.) The question isn't why are FB & Google investing so much in powerful multilingual models which handle hundreds of languages, but why aren't other entities as well?

link

ausbah 1445 days ago

what other entities would really have access to the text resources that FB & Google? outside of a few other large companies I can't imagine many

link

Jabbles 1445 days ago

Surely the fact that they did all the high-resource languages first and are only now getting round to the less-popular ones demonstrates that that is not, in fact, the case?

link

tehsauce 1445 days ago

I think the reason low resource languages are prioritized is to compensate for the fact that AI research normally has a tendency to marginalize these languages.

link

btheshoe 1445 days ago

yes, but what principles justify the importance placed on low resource languages?

link

froskur 1445 days ago

Low resource in this context means that there are few resources available to train a neural network with, not that there are few speakers. Although many low resource languages have relatively few speakers, there are also ones with tens of millions of speakers.

The reason for emphasis is in my opinion twofold: 1) Allowing these people to use the fancy language technology in their own language is good in and of itself. 2) Training neural networks on fewer resources is more difficult than using more resources and therefore a fun and interesting challenge.

link

macintux 1445 days ago

Plus presumably we learn more from solving harder problems, and we prepare for one day needing to translate some alien language in a hurry.

link

quink 1445 days ago

The examples given are, with native speaker numbers, Assamese (15 million), Catalan (4 million) and Kinyarwanda (10 million). These alone are more than an Australia.

Furthermore, Facebook considers the internet to consist of Facebook and Wikipedia (Zero).

I view this as just another extension of their Next Billion initiative, an effort to ensure that another billion people are monopolised by Facebook.

That's the payoff.

link

jw4ng 1445 days ago

We think it's important for AI to truly support everyone in the world. A world where AI only serves a subset of the population is not ideal. In machine translation, this means supporting as many language as possible at high quality. We also imagine a future where anyone will be able to communicate with anyone else seamlessly; this also means solving translations for all languages.

link

daniel-cussen 1445 days ago

Wouldn't that also entail a bot speaking in any language?

link

bobsmooth 1445 days ago

Text to speech is a separate problem.

link

dunefox 1445 days ago

Small data, big meaning is much more important than big data, little meaning. Much closer to real intelligence.

link

munificent 1445 days ago

Cynical answer: It's good PR.

link

onurcel 1445 days ago

hi @btheshoe, I work on this project in the data part. As others mentioned, the amount of data available for a language is not correlated to the number of speakers of that language, which explains the potential impact of focusing on these.

link