Hacker News new | ask | show | jobs
by Tabular-Iceberg 1445 days ago
My concern with this is that in low resource languages the unavoidable biases of the ML models might overpower their own organic development.

We shrug off all the little quirks of machine translated text because it usually gets the point across, and we recognize them as quirks because most of what we read was written by real people with no such quirks. But when most of what you read contain those quirks, I fear those will quickly become the standard way of writing and even speaking in those languages.

4 comments

This happens without machine translation in the wild already with pidgin. If you want to see real life pidgin in action, watch korean and english gamers interact in FPS games. This has been common at the borders of cultures where two languages interact.

Point being, I'm not sure if language purity is more valuable than functionally allowing its people to interact with things they couldn't otherwise. Put another way, should we leave these people locked out of many online resources they can't read because we fear of corrupting their language? Give these people the option and let them decide. Language evolves over time anyway.

People present these as the choice between 0 (“locked out”) and 1.

In real world instances (the proverbial 80%), it’s more often transforming a 0.4 (“don’t know much english”) into a 0.7. And the people who get away with near 0 knowledge will usually have no critical need for translation, or an access to other means (an actual translator, social help etc.) when really needed.

My mental image is grandmas reading online news, and machine translation would be a blessing and a curse. Or low grade school kids trying to look for some help on a topic, and a I’d wish they get more time with the original text to at least somewhat learn, than only getting the rough translation full of errors.

For interpersonal communication, people adjust, that’s what has been happening for centuries now.

> This happens without machine translation in the wild already with pidgin.

I said nothing about purity, I said organic evolution, which this is an example of. If the actual speakers want to develop a pidgin, fine, I just think it should be a decision made by people and not models.

In a worst case you can end up with the Scots Wikipedia situation, where some power editor created a bunch of pages using an entirely fabricated, overly stereotypical language and that influenced what people thought Scots actually was.
This is one of the examples we keep in mind and that's also why we can't 100% trust public dataset labels. This motivated us to train a Language IDentification system for all the languages we wanted to handle in order to build the monolingual dataset. More details in the paper ;) Or here, if you have questions
I think it will interesting when it runs into a language (e.g. Dakota) where the women and men speak differently. Should be an interesting test.
Doesn't seem to be a big issue for Arabic, where verbs are gendered (so in the sentence "I am going to the store", the verb "to go" will be either masculine or feminine, reflecting the speaker's gender).
> so in the sentence "I am going to the store", the verb "to go" will be either masculine or feminine, reflecting the speaker's gender

But there the rules are the same for everyone. This is not true in general; there are languages where men and women speak according to different rules.

Here's a selection from Empires of the Word:

> These works [written by women] are usually written in Emesal, 'the fine tongue', a separate dialect of Sumerian, well documented in scribal dictionaries. In dialogue works this dialect is used for the speech of goddesses. It differs from standard Sumerian, Emegir, 'the princely tongue', both in vocabulary (including the names of many gods) and also in pronunciation (consonants by and large being articulated farther forward in the mouth); it differs not at all in its grammar. For example, when the goddess Inanna is affecting to repel the advances of an importunate suitor, she cries:

> kuli Mulila šu bamu emeše daŋen amaŋu lulaše ta munaben amaŋu Gašangale lulaše ta munaben

> Friend of Enlil, let me free! Let me go to my house! What lie shall I tell my mother? What lie shall I tell my mother Ningal?

> Both Enlil and Ningal are, of course, gods. In Emegir this would have been (with the differences highlighted):

> kuli Enlila šu bamu eŋuše gaŋen amaŋu lulaše ana munaben amaŋu Ningale lulaše ana munaben

Arabic is the 5th or 6th most spoken language. I think the concern for low resource languages is that nuances like that won't get picked up.
That's fair, I was mostly just responding to the parent comment's point about language models running into potential difficulties in languages where the men and women speak differently (though I don't speak Dakota, so the gender-specific differences there may be more pronounced than in Arabic, where there's also the "default"/neutral option of just picking the masculine version of verbs unless you know the subject(s) are female).
That's not what I meant. It isn't the words that are gendered, but the way the speaker talks that is gendered. My old boss was taught to speak by her uncles. Her female relatives teased her since she talked like a man.
Won't people trying to learn a low resource language as as a second language also bring their influence?