Hacker News new | ask | show | jobs
by allpratik 3029 days ago
Seriously, I am trying my luck in NLP with some Indian languages like Marathi, Hindi, Gujarati and Tamil. The sheer number of dialects is driving me nuts.

Even dialects have very different sounding words or pronunciations which increases the complexity exponentially. But I am trying it only with an eye on its potential. NLP will simply act as the catalyst for technology adoption in the rural and semi-rural area.

I'm thinking to study China's approach here since I heard even Chinese is incredibly diverse.

2 comments

China does have very diverse dialects. But Mandarin is taught in all schools in mainland China and we treat it as standard. Also it is the unified way people from different area communicate.

For younger generation this is not a problem, back in my college life where people in the same class came from different areas, from Xinjiang to Canton to the northeast corner. We have no problem understanding each other though sometimes funny with unique words or accents.

It seems that the problem is not as severe in China as in India.

And what are called "dialects" in the Chinese context are just as much dialects as French, Italian, and Romanian are "dialects" of Romance.
These mostly get called “dialects” for political reasons: the Chinese government doesn’t want to acknowledge that they are separate mutually unintelligible languages. Media in Chinese languages other than Mandarin is restricted, children are forced to use Mandarin in school, all official business is done in Mandarin, etc. There is a concerted effort to make other languages economically unviable, and generally to disempower and discourage regional / minority cultures. The grandparent poster’s experience is evidence that this strategy is working out.

It is similar to the way the Chinese government assigns non-native political officials to rule each region, and severely censors any politically controversial communication/media. That is, it is yet another tool of authoritarian social control, an effort to forestall any political opposition to the central government and its unresponsive top-down decision-making process.

India does not have the same kind of authoritarian governing institutions, so similar forced homogenization would not be politically viable.

> the Chinese government doesn’t want to acknowledge that they are separate mutually unintelligible languages.

Except they are not mutually unintelligible. Put someone from heilongjiang province in Sichuan and they will still be able to understand the language, albeit with more difficulty.

Though there are dialects that do have completely different pronunciation, they all use the same underlying script, save a select few minority languages. Mandarin Chinese is taught in school, but everyone still uses the local dialect to speak with each other.

I'm not even denying the CCP has ulterior motives in doing this, but your original claim was simply incorrect and disingenuous.

Heilongjiang is part of the Northeast, aka former Manchuria and was settled in the mid to late 1800s from the North Chinese plain. The entire North Chinese plain speaks variants of Mandarin for the same reason North American English is far less diverse than British and Irish English, there was a relatively small recent founder population.

Wu (Shanghainese and the other related dialects of the Yangtze river delta), Yue (Cantonese), Hakka, Xiang and Min are absolutely languages. They're at least as divergent as the Romance languages or the different "dialects" of Arabic. Having a single written standard does not make the spoken varieties one language. And even if it did Cantonese has a written standard even if it's not used much, so there are at least two Chinese languages.

https://en.wikipedia.org/wiki/List_of_varieties_of_Chinese https://en.wikipedia.org/wiki/Written_Cantonese

I speak a branch of Wu myself, and it's ABSOLUTELY not a different language from other dialects of Chinese. There are small parts (also commonly used) of the dialects that's dramatically different from Mandarin, but most parts are still the same. Especially if you need to speak about things in a more formal context, or describe concepts that are more abstract, the dialects has no difference with each other if written down
Cantonese, Hokkien, and Mandarin are not at all mutually intelligible. Mandarin speakers can't even read Hong Kong newspapers fluently.

Only Mandarin and Cantonese even have a fully developed way of writing with characters. Up until relatively recently Mandarin itself was considered a spoken language, until a written standard (ie correspondence of characters with the words people actually spoke) was developed. Hokkien is in the process of this now in Taiwan, they literally have a government department choosing characters for words (They started off with 900 or something, not sure where they are up to now).

> Only Mandarin and Cantonese even have a fully developed way of writing with characters. Up until relatively recently Mandarin itself was considered a spoken language

Native speaker here. I have absolutely no idea where you get that.

Cantonese is just one of many dialects, and in fact, it is not a single dialect: People from different parts of Guadong province actually speak Cantonese very differently. Should you consider those different languages?

Cantonese, Hokkien and Mandarin do sound like different languages, but not all dialects are. Most Chinese speaker can understand dialects spoken in central, and north parts of China, even though they usually can't speak those dialects.

Even though some of the dialects sounds very differently, the words, syntax, sentences being used are actually the same. That's how people can read what other people speaking other dialects write, with no problem.

To complicate the issue even more, there're not one, but two writing systems currently being used: Simplified Chinese is used in China mainland and Singapore, while Traditional Chinese is used in Hong Kong, Macao and Taiwan. That's the reason people from mainland China (no matter what dialect they speak, even Cantonese) cannot read Hong Kong newspaper fluently

The two writing systems are different but they have one to one mapping for each character. So it's also not two unrelated system.

Are you a native speaker? Because 廣東話 and 閩南話 are totally unintelligible to me, and I want your language superpower.
Its just typical chinese nationalist mythology. 'We all speak the same language' is just another fairy tale they drill into their heads, along with '5000 years of culture'

When youve talked to one of them youve talked to all of them.

I don't see any part of your comment being correct.

First of all, "Mandarin" is spoken not only in China mainland, but also the standard in Singapore and Taiwan. It was a creation by the Republic of China (which later became Taiwan government) back in 1923, long before the current Chinese government came into power.

Children use Mandarin in school because they have to learn it to be able to communicate with people coming from other parts of China, which would have become a huge disadvantage to themselves. (I can't imagine how I would communicate with other people in college otherwise) It doesn't mean people will forget how to speak their own dialect. In fact, people from the same region almost always speak their own dialects.

Regional / minority cultures are generally protected by the government. The minority are almost always over-represented in all kinds of national events. Being a minority in China means you can get tons of advantage (lower score required to enter good colleges, financial aid, etc.)

The language we call Mandarin has been the native language of some parts of northern China for thousands of years, and was certainly not “created” any time recently. Some people speaking dialects of that language migrated to other parts of China. But there are various other languages natively spoken elsewhere in the country.

Singapore is a cosmopolitan port city, there are several Chinese languages spoken there, and Mandarin was not the dominant one until recently. There are also many other languages spoken in Singapore, and from what I understand English is the primary language used for official business. Taiwan was not natively Mandarin speaking but speaks it now because it was taken over (from the Japanese) by the fleeing Mandarin-speaking KMT after they were beaten militarily by the Communists during the Chinese Civil War. Both Singapore and Taiwan were ruled for decades by authoritarian governments. I’m not sure about Singapore but in Taiwan other Chinese languages were forcefully suppressed.

Plenty of other parts of the world manage to communicate across regional/national borders without restricting people’s ability to produce/distribute local media in their native languages.

There are many countries where students learn several languages in school (including their native regional language and a national language) from an early age.

(Disclaimer again: I’m not an expert in the history, politics, or comparative linguistics of China. I recommend Wikipedia as a better first summary, if you are curious to learn about these subjects.)

If you can read Chinese, the Chinese versions of wikipedia page on Mandarin Chinese has a lot more detail on its origination: https://zh.wikipedia.org/wiki/普通话

If you cannot, I found an English article for you: http://www.alittledynasty.com/history-of-mandarin-chinese.ht...

To summarize, Mandarin is not created out of nothing for sure, but the concept of "Mandarin Chinese" (or rather, Standard Chinese) started with an effort of newly established Republic of China in 1913, to develop a standard phonetic system and to use as the national language in China. They later published the standard around 1920s, which is essentially a modified version of phonetic system used in Beijing. The dialect now spoken in Beijing is very close to Mandarin, but not exactly the same.

I grew up in China and lived in Singapore for a long time. I can tell you for sure, that the different dialects spoken by Chinese should not be confused with completely different languages. First of all they share the same writing system, the words and syntax we use in various dialects are mostly the same. (Some dialects use a few words differently from others, but that's not surprising at all considering UK english and US english are not exactly the same)

I speak a southern dialect myself which sounds very different from Mandarin. But there is a somewhat systematic mapping from the dialect to Mandarin, so it was really not much an effort to learn Mandarin.

I can imagine there must have been some efforts there to promote the standard in the very beginning, maybe even "forcefully suppressing" other dialects are needed at some point, but considering the huge benefit, it undoubted is the best invention happened in the history of Chinese language.

> The language we call Mandarin has been the native language of some parts of northern China for thousands of years, and was certainly not “created” any time recently.

In the same way that Hindi has been the native language of northern India for thousands of years (which is to say that while the Mandarin of today has connections to earlier forms of Chinese, it is hardly a monolithic, unchanging remnant of thousands of years ago).

> I’m not sure about Singapore but in Taiwan other Chinese languages were forcefully suppressed.

Singapore was much more successful in suppressing Hokkien than Taiwan was.

https://en.wikipedia.org/wiki/Speak_Mandarin_Campaign

However, it's worth noting that you call the official language of China "Mandarin" for political reasons. The analogy would be if you called French "Bureaucratese" and said "Yes, but Breton and Occitan are not mutually intelligible with Bureaucratese".

The statements "Bureaucratese is not mutually intelligible with Occitan" and "Mandarin is not mutually intelligible with Cantonese" are both true, but we could just say "French is not mutually intelligible with Occitan" and "Chinese in not mutually intelligible with Cantonese".

I could call it Beijingese (or Pekingese) if you prefer. But many people might not know what I was talking about. Mandarin is the common name used in English to refer to this language.

I don’t have any problem if you want to talk about Parisian French (or pick your preferred other name for it), Castilian Spanish, etc.

Parisian French was pushed onto the people within the borders of the French nation-state by force, by a brutal authoritarian monarchy. Quoting Wikipedia,

‘The goals of the Public School System were made especially clear to the French speaking teachers sent to teach students in regions such as Occitania and Brittany; “And remember, Gents: you were given your position in order to kill the Breton language” were instructions given from a French official to teachers in the French department of Finistère (western Brittany).’

The French state continues to repress minority languages inside its borders. See https://en.wikipedia.org/wiki/Language_policy_in_France

The word Mandarin comes from the Sanskrit word "Mantri", for minister. I wonder why this word was chosen.
In fact, it's just called "Chinese" by Chinese speakers. I never heard of the word "Mandarin" before I came to US.
That's how languages form. You don't get a language like French just by hoping for it to emerge or by letting dialects run their own lives and continue to diverge; you take it by taking a bunch of Romance dialects including quite separate ones (Langue d'Oc and the Langue d'Oil groups) and pushing them together through a common system of mass media (printed for the time, but still), education and cultural acceptance of a "proper dialect" at the top end of the society. That's how you get a strong language that helps you to unify a country and reduce internal barriers of communication; and that's what the Chinese are doing.
By "that's how languages form", I assume you mean something like "standardized languages", which then becomes somewhat circular, because it's certainly not how languages form generally (and even standardized languages can form without mass media, and certainly did, even pre-writing).

That's also not how standard French formed. Standardized French is Parisian French, so just the 'dialect' of the politically-important centre. Not too dissimilar to Latin (which was originally narrowly the dialect of Rome) in that.

Whatever the propaganda, China still contains a number of non-mutually intelligible (though related) languages, but of course Mandarin enjoys much prominence and has an enormous number of speakers.

I never heard that the language was standardized taking langue d’oc elements or any other French languages or dialects. You should give some sources to affirm that. The standard French is based on the French spoken in Paris which was then extended (or imposed in some case) in the whole country via the administration and the public school (forbidding the use of dialects).
While this might be true for spoken language, luckily the Chinese writing system does represent how the the words are pronounced. All of these "dialects" share the same writing system. So it's much easier for someone speaking one of these dialects to learn and use Mandarin.
How does phonological complexity affect these things? I'm not a linguist, but Hindi at least seems way complex, with a much larger phoneme inventory than other languages I've studied.
I don't think Hindi's phonology is more complex than English, but it is all explicit in the character set rather than implicit in the etymology. (The voicing difference between "this" and "thistle" is my favorite example of this, even beyond the fact that we're representing one of the most common consonant sounds in the language with two characters.)
In the World Atlas of Language Structures http://wals.info/chapter/1 , Hindi is considered to have a large consononant inventory. Hindi consonants plus vowels is not hugely different than English. But even though it's relatively large, I certainly wouldn't describe it as complex. It's mostly just the same properties repeated in different locations.

Language areas aren't made more complex by additional phonological complexity. The question you've asked seems to be "Does phonological complexity cause more language diversity". When put this way, there doesn't seem to be any causal mechanism that could do it. For instance, one might say: Well, English has a lot of vowels. People in California might simplify them one way, whereas people in Texas might simplify them in another way so that they can't understand each other well: therefore, you get additional linguistic diversity. But this requires the Texans and the Californians to be isolated from each other which isn't what people mean when they say "India is a very linguistically diverse country".

If we try it the other way "Does language diversity cause more phonological complexity", languages tend towards each other in case of diversity (because a person who speaks both Chinese and English will sometimes adopt features of one language into the other). This can sometimes lead to the propagation of more sounds (for instance, languages pretty much only use clicks if they're in contact with other languages that use clicks). And sometimes it can lead the elimination of them. I'm not sure of any particular research about this question, but my guess is on average it would tend towards the average, but if you took English and put it in India it wouldn't be too long before English in India sounds a lot more Indian.