Hacker News new | ask | show | jobs
by luke_s 3496 days ago
I tried with some facebook comments and articles posted to facebook in chinese and the results were pretty much incomprehensible as well. It was so bad, it makes me wonder - has google actually enabled this for all users?
2 comments

Their training dataset is almost certainly biased towards 'formal' Chinese sources, e.g. newspapers, news broadcasts, and so on. This is probably true for every language translation dataset, but at least anecdotally I can confirm the massive disconnect between spoken and written Chinese.

It's really interesting culturally, since modern written Chinese is split between Simplified (PRC) and Traditional (HK/TW/etc), because Mao thought Traditional was too difficult for the proletariat. Yet official national news sources in China are almost always given in formal Chinese, which nobody outside of the elite really speaks!

It's not a difference between "elite" and "non-elite" paragraph. It's the difference between written and spoken language.

Go to any USA Today or WSJ article and read a paragraph out loud; no one talks like that.

This effect is also extremely noticeable in the Finnish language. The rules of Finnish grammar are followed much more strictly when writing any kind of text, than they are when speaking. There are rules of grammar that are always followed when writing, but are not really that important when speaking.

As an example, take the sentence "kirja on työpöydälläni", which means "the book is on my desk". The word "työpöytä" (desk) gets two suffixes, "-llä" which corresponds to the preposition "on", and "-ni" which is the first-person genitive. But when speaking, this would easily come out as "kirja on minun työpöydällä" instead, where the noun doesn't isn't in the genitive form at all anymore, the genitive has become a separate word which is a pronoun with a genitive ("minun").

If you study just the grammatical rules and nothing else, you might think that the second sentence is obviously grammatically wrong. (Because according to the rule, the noun must change its case to correspond to the genitive.) Yet it's completely acceptable to say it aloud that way, even in a formal context, and nobody would bat an eye. While at the same time if you put it this way in any kind of writing, you would almost surely be notified by the grammar police that you have made a grave mistake.

I find this duality of language fascinating. And this will certainly continue producing problems for the field of machine translation. Google Translate is infamous in Finland for being near-useless for translating anything to or from Finnish.

The GP was about Weibo messages/posts, and how those written messages reflect colloquial or spoken language much more closely than something from Sina.
Written text in SMS and twitter tends to a lot closer to the way people speak, whereas newspapers are a lot more "polished". So there is definitely a bias depending on what you train your algorithm on. Also, topics and vocabulary may differ widely so if you train on newspaper, your algorithm will struggle with phone conversations.

If you are talking about Modern Spoken Mandarin (or written Spoken Mandarin: SMS, social media) vs Modern Written Mandarin I don't think the gap is that large compared to other languages. Certainly a lot less than the gap between written Colloquial English and Formal English (more words of Latin origins).

Looking at the People's Daily website (which is presumably an official news source in China), it looks like standard newspaper Chinese. Should be readable for most Chinese people with at least primary education.

"I don't think the gap is that large compared to other languages"

As someone learning Chinese, I can sympathize with Google Translate. Spoken Mandarin doesn't give you nearly as much context as more modern written Mandarin. I have no problem reading a newspaper but real conversation between Chinese people is just lost on me. It's not just a pace of listening thing, there is just too much of the sentence that isn't said out loud.

i tried the same with portuguese comments/text where i always felt Facebooks translation was pretty bad and in google translate it is a lot better, to the point you hardly notice it's translated by an algorithm at all.
Actually I still think Google is pretty bad at going between all Portuguese variants.

It puts them all in the same bag, resulting in very strange translations when using it as target language.

Going the other way around, I am yet to properly translate any of the variants into a way that all verbs and articles keep their sense across languages.

For example, translating você to either Du or Sie in German, depending on the Portuguese variant being used.