I'm not entirely sure why low resource languages are seen as such a high priority for AI research. It seems that by definition there's little payoff to solving translation for them.
I don't really remember the exact numbers anymore, but covering only the top 5 languages will cover maybe 40% of the world population, while covering the top 200 languages (many of them low resource) will cover maybe 90% of the world population.
I am. That's why I mentioned that you can not infer my statements directly from the numbers you find on Wikipedia etc. You can not simply add up those numbers.
"Low-resource language" isn't just a euphemism for "language almost nobody speaks". There are many languages that are widely spoken but nonetheless are hard to obtain training data for. Getting something like Wikipedia going for a minority language can be a difficult chicken-and-egg problem because users will use English for its completeness/recency, despite their limited fluency, and the native-language Wikipedia remains neglected. So you can end up in a situation where users use one language for social media and another for news/research, and Facebook is in a unique position to care about the former.
Aside from the fact that being able to generalise a model with very little training data is an important AI research problem to solve, language death is a serious concern and is being accelerated due to the fact that many languages are not supported at all by modern technology (leading to "prestige language" pressures that are a known cause of historical language death).
For instance, Icelandic is not supported by any modern smartphone platform which has lead to Icelandic natives communicating with each other in English and very little information is translated to Icelandic[1,2].
That being said, I am worried that having translations that are "too good" could also act to accelerate language death as the importance of keeping languages alive will seem less significant (to non-language-nerds) if we can translate works written in that language to any other language with very small datasets. Luckily I'm not convinced that AI models will be able to produce convincing and consistent translations for a long time -- languages are so different in so many ways that I can't see how adding more dimensions and parameters to a model would account for them.
The point is that there are lots of humans who speak these languages and use tech. They just don’t use Wikipedia so getting a good translation corpus going was harder.
And it's both cumulative across all those languages (see above), cheap/amortized (if you can do a good multilingual NMT for 50 languages, how hard can 50+1 languages be?), and many of those languages are likely to grow both in terms of sheer population and in GDP. (Think about South Asian or African countries like Indonesia or Nigeria.) The question isn't why are FB & Google investing so much in powerful multilingual models which handle hundreds of languages, but why aren't other entities as well?
Surely the fact that they did all the high-resource languages first and are only now getting round to the less-popular ones demonstrates that that is not, in fact, the case?
I think the reason low resource languages are prioritized is to compensate for the fact that AI research normally has a tendency to marginalize these languages.
Low resource in this context means that there are few resources available to train a neural network with, not that there are few speakers. Although many low resource languages have relatively few speakers, there are also ones with tens of millions of speakers.
The reason for emphasis is in my opinion twofold: 1) Allowing these people to use the fancy language technology in their own language is good in and of itself. 2) Training neural networks on fewer resources is more difficult than using more resources and therefore a fun and interesting challenge.
The examples given are, with native speaker numbers, Assamese (15 million), Catalan (4 million) and Kinyarwanda (10 million). These alone are more than an Australia.
Furthermore, Facebook considers the internet to consist of Facebook and Wikipedia (Zero).
I view this as just another extension of their Next Billion initiative, an effort to ensure that another billion people are monopolised by Facebook.
We think it's important for AI to truly support everyone in the world. A world where AI only serves a subset of the population is not ideal. In machine translation, this means supporting as many language as possible at high quality. We also imagine a future where anyone will be able to communicate with anyone else seamlessly; this also means solving translations for all languages.
hi @btheshoe, I work on this project in the data part. As others mentioned, the amount of data available for a language is not correlated to the number of speakers of that language, which explains the potential impact of focusing on these.
Some numbers (but you can not exactly infer from them such accumulated numbers): https://en.wikipedia.org/wiki/List_of_languages_by_total_num...
Some more numbers from here: https://www.sciencedirect.com/science/article/pii/S016763931...
"96% of the world’s languages are spoken by only 4% of its people."
Although this statement is more about the tail from the approx 7000 languages.