Hacker News new | ask | show | jobs
by ftyers 1775 days ago
One of the most noticeable additions in my opinion is Guarani, the first Indigenous language of the Americas to be added. Indigenous languages are extremely poorly supported and forgotten by all of the major platforms and companies, and it's great to see one getting the attention they deserve. (Disclaimer: I was involved)
2 comments

Whoah, 6.5 million native speakers! That's several orders of magnitude more than I was expecting. It's also significantly larger than the native-speaking populations of languages like Catalan, Basque, or Romansh, which might be more familiar to North Americans or Europeans.
There are a number of Native American languages that have numerous speakers, but until recently have been marginalized, repressed and ignored (and some to this day). Guarani is the most numerous, but also Quechua, Nahuatl, and the various Mayan languages (spoken by around half of Guatemalans, and another 2.5 million Mexicans).
I am shocked, shocked to learn that countries identifying as Latin America suppress non-Latin derived languages!
This is a very weird statement.

The problem is not the Spanish language. The problem is a colonial peasant economy/society that turned into a post-colonial peasant society, with land-owning quasi-nobility ruling over disempowered (in this case, indigenous) laborers and freely exercising their power to steal, rape, kill, etc., without penalty; it is a situation more or less comparable to peasant societies around the world and throughout history, which are always very exploitative and often racist.

Working class Spanish speakers living in towns were in many ways also economically exploited, but considered “better than indigenous people” to be a core part of their identities, and also felt free to beat them, steal from them, etc. where they found the opportunity. It’s a situation broadly comparable to race relations in the US south, where poor whites considered “better than blacks” to be a defining part of their identity.

Perhaps counterintuitively, the history of exploitation of indigenous communities, and the way indigenous people were shut out of many social and economic activities, led to the preservation of native languages.

Without wishing to get political, is the difference that Iceland is a country but Guarani speakers don't have a nation-state of their own? Or something else?
Note that Icelandic is currently not well supported either ("In progress" with 384/5000 sentences and 86% Localized). Actually, Guaraní is better supported at the moment, and quite a number of other common smaller-ish languages aren't well supported yet either such as Hebrew, Danish, and even Korean (which is not small or even small-ish at all). Some other smaller languages are, such as Breton or Irish. Overall, it's a bit inconsistent. I suppose that this is because in the end, these things depend on the number of people contributing; there's a reason Esperanto is near the top, as it has a very active community of enthusiasts who love to promote the language.
It takes about a week to get the interface translated and to start collection, for any language with at least 5000 sentences in the public domain. I helped bootstrap Guarani and Breton and a few other languages spoken by friends of mine, but in the end, it just takes one or two people. I think in general there is a big difference in engagement if STT/ASR already exists for the language (e.g. Hebrew, Danish and Korean) and if it doesn't exist at all.
It's an official language of Paraguay
In case anyone else wanted to know more, there are, apparently, 2 official languages and the other is Spanish. https://www.servat.unibe.ch/icl/pa00000_.html#A140_
The difference is completely and inherently political.
I think this is overly dismissive of other factors. Whether or not a language is supported by something on the Internet has a lot more to do with financial incentives than politics. If there were a huge consumer market clamoring to give their money to a site and the only barrier were language, it'd get exploited pretty quickly.
This is superficially correct, also completely disingenuous.

The reason why there isn't a huge consumer market for indigenous languages is because they're overwhelmingly systematically unsupported by their respective governments in favor of the non-indigenous colonial languages.

To be clear, that's not Mozilla's fault, and not something they or other random organizations can fix, but as human beings we should all be happy and give credit to those organizations that do their small part.

No it has a lot to do with politics as well. A sovereign nation may find it important to have their languages supported widely on the internet so they might use some of the public funds into funding translation efforts and voice recognition/speech synthesizer contributions.

I know the Icelandic government spends some money for this and it shows. This tiny language has way more support then other way more spoken languages. If the Norwegian government wanted I bet the Sámi languages could have just as good of a support as Icelandic. Or if the Greenlandic government had more funds available I bet we would see Kalaallisut in more places online.

Like any feature, perhaps it has to do with the volume of anticpated use vs the effort to support.
Nation-states are political entities, so choosing languages by such a distinction would absolutely be political.
I'm sure having a nation-state is a major factor, but I bet it also has to do with the average wealth, geographic location, historical alliances. However, I'd put my money on skin color as the biggest factor.
As an example in favor of your conclusion, I propose Greenlandic. Geographically really close to Iceland, is the sole official language of an autonomous country, significant cultural heritage (with even a famous [possible] dwarf planet named after one of their historic gods). However—unlike Iceland—Greenland is not a wealthy country, and tend to have darker skin color then Icelanders.
Autonomous territory, not a country.
>It is one of the official languages of Paraguay (along with Spanish), where it is spoken by the majority of the population, and where half of the rural population is monolingual.

Wow, I had no idea

Catalan has about 10 million speakers.
In total, yes, but only about 4 million _native_ speakers.
As an Icelander I am always really impressed with how well my language—a language spoken by a few hundred thousand people worldwide—is supported on various platforms and technologies. This is probably in no small part thanks to active participation by native speakers and even some government funding.

However I at the same time I’m also deeply disappointed by the lack of support for Iceland’s closest neighbour’s language—Greenlandic—which is an indigenous language, the sole official language of an autonomous country.

I saw the same when I was younger for Norwegian. Bokmål is the most commonly written form of Norwegian, but New Norwegian is used by about ~15%. Most software included Bokmål support, but you could bet some hardcore user of New Norwegian had made a language pack available as well.
Ah, I remember "Nynorsk" (sorry for the bad spelling and ASCIIation) localisation of GNOME from early 2000s!

Generally, it takes only a few dedicated people to get software localised if good enough infrastructure is provided by the community!

I hope that's what we see with Mozilla Common Voice too!

"Nynorsk" is correct, no non-ASCII shenanigans in that word :)
For Mozilla Common Voice, it looks like even Bokmål isn't listed as dataset yet. Language packs have the advantage that a single dedicated user can come up with the entire thing, but for voice collections you need a large variety of different people and ideally tons of them. For any language with a small native speaker population, even a rich one like Norway and especially a fractional subset like Nynorsk, getting enough speakers to participate in open source collection efforts will remain a challenge. Purportedly, even for commercial companies it's hard to find enough Norwegians willing to speak a few sentences for a nominal payment unlike most other countries.

Luckily, speech recognition research is making some good progress on dealing with low-resource languages so hopefully we'll see some acceptable models made from the little available open data that's out there.

> However I at the same time I’m also deeply disappointed by the lack of support for Iceland’s closest neighbour’s language—Greenlandic—which is an indigenous language, the sole official language of an autonomous country.

I'm not sure "autonomous country" is an accurate description of what Greenland is. It is - for all intents and purposes - a devolved region of Denmark. It is still way too reliant on economic aid to be able to be independent and, honestly, probably couldn't exist as a developed nation without a patron (Denmark) or without selling its land/resources to a great power (USA, China). And the population is only 1/6 the size of Iceland's and is very dispersed on a massive arctic island, with most people living in tiny isolated villages by the coast.

With that in mind, you wouldn't expect great language support unless the Danish state steps in and spends some serious dough on it. I actually work on Danish language technology at the University of Copenhagen and let me tell you something... the Danish state hardly spends any money on Danish language resources either. We envy the kind of funding that researchers in countries like Iceland and Norway have access too.

> the Danish state hardly spends any money on Danish language resources either.

I’m actually a little disappointed that there is not more collaboration between the language departments in Iceland and Greenland. Iceland does spend some money on foreign languages and there is much interest in general for foreign languages in Iceland. The former president Vigdís Finnbogadóttir is a huge language buff and advocates for foreign languages a lot. So much so that the house of foreign languages at the University is named after her (https://vigdis.hi.is/).

It is generally believed in Iceland that setting up Icelandic cultural institutions in Reykjavík played a big part in our independence. Institutions such as the University, libraries and the National Theater. There is also big interest for Greenlandic independence in Iceland. Therefor it would make sense for a rich country like Iceland to spend some money in progressing the status of Kalaallisut, both in Iceland (by shared cultural events), Greenland (by help funding cultural institutions) and internationally (by help funding online language efforts).

I’m writing this as a separate comment since it is an aside (i.e. not about investments in progressing indigenous languages online).

I don’t think it is wrong to call Greenland a country. As mentioned elsewhere, the word country is not strictly defined. Sometimes it means strictly independent nations, but most of the time it doesn’t. E.g. here is CIA calling Greenland a country (https://www.cia.gov/the-world-factbook/countries/greenland/).