Hacker News new | ask | show | jobs
by ggreer 4258 days ago
It's important to separate spoken information density from written information density. Some languages win at one while losing at the other. Your arabic example was shorter than the equivalent english on paper, but longer when spoken (4 syllables vs 3).

In terms of information density per syllable, mandarin wins, with english coming in a close second. When speaking, english usually has more syllables per unit time than mandarin, so english has the highest spoken information density of any language. Japanese is the on the opposite end of the spectrum. Despite having the highest syllabic rate, it has the lowest information density.[1]

For written information density, logographic languages win. This is pretty obvious if you've seen a Chinese or Japanese translation of something familiar, such as a Harry Potter book. They're ludicrously thin.

1. See the figures at the end of this paper: http://www.ddl.ish-lyon.cnrs.fr/fulltext/pellegrino/Pellegri...

2 comments

This is very cool, man. Thanks for the link. It is so much fun when on HN and someone brings up a topic and someone throws out established research for said topic without much delay, no matter how big or small.

Like Apple fanbois have "there's an app for that", I love HN moments "Oh I got a citation for that" and for topics I would find very difficult to research at a cursory glance!

> When speaking, english usually has more syllables per unit time than mandarin, so english has the highest spoken information density of any language.

Of the seven languages in the study, using 20 specific short texts, that were originally written in English then translated (well?) in other languages.

They recognized this issue and accounted for it. From the paper:

Since the texts were not explicitly designed for detailed cross-language comparison, they exhibit a rather large variation in length. For instance, the lengths of the 20 English texts range from 62 to 104 syllables. To deal with this variation, each text was matched with its translation in an eighth language, Vietnamese (VI), different from the seven languages of the corpus. This external point of reference was used to normalize the parameters for each text in each language and consequently to facilitate the interpretation by comparison with a mostly isolating language (see below).

It shouldn't be particularly surprising that english comes out ahead. It has a huge vocabulary, tons of phonemes, and makes many parts of speech optional. It lacks tones, but would probably have to sacrifice some phonemes to stay comprehensible.

That just deals with the variation in length of the texts, not the effect of translation quality or other possible problems with the experiment, like written -> spoken conversion.