Hacker News new | ask | show | jobs
by 616c 4263 days ago
That is a fun experiment. I studied linguistics in college, and I do not think anyone ever discussed textual density of different languages with the "same" content (the latter part would be its own terrifying chestnut; if you have not studied machine translation and semantic eval and good luck ever confirming such a statement).

I studied Arabic a lot, and Chinese about a year. I cannot speak to Chinese with only one hazy year under my belt, but I can speak to Arabic.

Because Arabic has lots of syntax realized at the morpholgical level, you can encode a whole sentence (subject (with declension inherent and gender variable, verb conjugated (to passive/active, past/present/future, standard/subjunctive) and direct object (declension inherent and gender variable) all in one word as we know the in English.

أضربه (A-dr-b-u; a (I) dr-b (hit) u (him/it): I hit him (present tense)

And that is a super simple example. I have seen much more compicated setences in one word, and even better in two or three. So, I hypothesized Arabic is very, very dense. I think and Russian and others could be considered similar.

However, with this level of density (maybe we argue "compression" from a CS perspective) I noticed books and their translation were routinely about the same length in pages. Never identical mind you, but never something crazy like 50 pages more (I am guessing; it has been a long time since I made such an experiment and would have trouble agreeing with someone on what is significant).

Now, one could hypothesize a shitload about what this means, but computation is realized as the same "stuff" (machine code instructions) in programming languages, where no parallel exists in human language for mapping human language to computaion, as far as I know from my between minor and major courseload in linguistics, specifically computational linguistics. If someone can contradict me, I would LOVE to read about measured cognition and language constructs.

4 comments

It's important to separate spoken information density from written information density. Some languages win at one while losing at the other. Your arabic example was shorter than the equivalent english on paper, but longer when spoken (4 syllables vs 3).

In terms of information density per syllable, mandarin wins, with english coming in a close second. When speaking, english usually has more syllables per unit time than mandarin, so english has the highest spoken information density of any language. Japanese is the on the opposite end of the spectrum. Despite having the highest syllabic rate, it has the lowest information density.[1]

For written information density, logographic languages win. This is pretty obvious if you've seen a Chinese or Japanese translation of something familiar, such as a Harry Potter book. They're ludicrously thin.

1. See the figures at the end of this paper: http://www.ddl.ish-lyon.cnrs.fr/fulltext/pellegrino/Pellegri...

This is very cool, man. Thanks for the link. It is so much fun when on HN and someone brings up a topic and someone throws out established research for said topic without much delay, no matter how big or small.

Like Apple fanbois have "there's an app for that", I love HN moments "Oh I got a citation for that" and for topics I would find very difficult to research at a cursory glance!

> When speaking, english usually has more syllables per unit time than mandarin, so english has the highest spoken information density of any language.

Of the seven languages in the study, using 20 specific short texts, that were originally written in English then translated (well?) in other languages.

They recognized this issue and accounted for it. From the paper:

Since the texts were not explicitly designed for detailed cross-language comparison, they exhibit a rather large variation in length. For instance, the lengths of the 20 English texts range from 62 to 104 syllables. To deal with this variation, each text was matched with its translation in an eighth language, Vietnamese (VI), different from the seven languages of the corpus. This external point of reference was used to normalize the parameters for each text in each language and consequently to facilitate the interpretation by comparison with a mostly isolating language (see below).

It shouldn't be particularly surprising that english comes out ahead. It has a huge vocabulary, tons of phonemes, and makes many parts of speech optional. It lacks tones, but would probably have to sacrifice some phonemes to stay comprehensible.

That just deals with the variation in length of the texts, not the effect of translation quality or other possible problems with the experiment, like written -> spoken conversion.
Russian is actually less dense than even english, but compensates it with flexibility. The phrase above could be written in a lot of different ways, which would emphasize different parts of the sentence, and give it a different tone.
The written a lot of different ways is also the case with Arabic, except for the one word limitation, since obviously Subject-Verb-Object encoding in one word requires the word order.

If we loosen that req, it gets more interesting. I assume Russian will line up with the following.

In Arabic, the default in formal Standard Arabic (not the dialects, that is another can of Bedouin worms) is Verb Subject Object. You can, however, have VSO, SVO, OSV, OVS, depending on context. I think you decline and conugate verbs in Arabic as you would in Russian. So you can probably play with written form, emphasizing different parts as you suggest in a similar way.

Am I way off? That is what I gathered from Russian/USSR republic kids I have befriended over the years. Not sure if that scans.

I disagree: Translations are rarely accused of being as good or as comprehensive as the original. The fact that you can tell a story in 300 pages in Arabic and 300 pages in English is irrelevant.

Iverson received a turing award[1] for his work on this subject.

[1]: http://www.jdl.ac.cn/turing/pdf/p444-iverson.pdf

Cool. Will definitely read more about this then.
Those crazy middle-easterners. How can they calculate with such terse number notation? As if 27 is more readable than XXVII! ;-)