Hacker News new | ask | show | jobs
by wyc 4255 days ago
If I posted four lines of Chinese or Sanskrit, it's likely that native English speakers would disagree that they had much meaning either. However, this doesn't mean that those lines are inherently devoid of meaning or difficult to parse.
3 comments

That is a fun experiment. I studied linguistics in college, and I do not think anyone ever discussed textual density of different languages with the "same" content (the latter part would be its own terrifying chestnut; if you have not studied machine translation and semantic eval and good luck ever confirming such a statement).

I studied Arabic a lot, and Chinese about a year. I cannot speak to Chinese with only one hazy year under my belt, but I can speak to Arabic.

Because Arabic has lots of syntax realized at the morpholgical level, you can encode a whole sentence (subject (with declension inherent and gender variable, verb conjugated (to passive/active, past/present/future, standard/subjunctive) and direct object (declension inherent and gender variable) all in one word as we know the in English.

أضربه (A-dr-b-u; a (I) dr-b (hit) u (him/it): I hit him (present tense)

And that is a super simple example. I have seen much more compicated setences in one word, and even better in two or three. So, I hypothesized Arabic is very, very dense. I think and Russian and others could be considered similar.

However, with this level of density (maybe we argue "compression" from a CS perspective) I noticed books and their translation were routinely about the same length in pages. Never identical mind you, but never something crazy like 50 pages more (I am guessing; it has been a long time since I made such an experiment and would have trouble agreeing with someone on what is significant).

Now, one could hypothesize a shitload about what this means, but computation is realized as the same "stuff" (machine code instructions) in programming languages, where no parallel exists in human language for mapping human language to computaion, as far as I know from my between minor and major courseload in linguistics, specifically computational linguistics. If someone can contradict me, I would LOVE to read about measured cognition and language constructs.

It's important to separate spoken information density from written information density. Some languages win at one while losing at the other. Your arabic example was shorter than the equivalent english on paper, but longer when spoken (4 syllables vs 3).

In terms of information density per syllable, mandarin wins, with english coming in a close second. When speaking, english usually has more syllables per unit time than mandarin, so english has the highest spoken information density of any language. Japanese is the on the opposite end of the spectrum. Despite having the highest syllabic rate, it has the lowest information density.[1]

For written information density, logographic languages win. This is pretty obvious if you've seen a Chinese or Japanese translation of something familiar, such as a Harry Potter book. They're ludicrously thin.

1. See the figures at the end of this paper: http://www.ddl.ish-lyon.cnrs.fr/fulltext/pellegrino/Pellegri...

This is very cool, man. Thanks for the link. It is so much fun when on HN and someone brings up a topic and someone throws out established research for said topic without much delay, no matter how big or small.

Like Apple fanbois have "there's an app for that", I love HN moments "Oh I got a citation for that" and for topics I would find very difficult to research at a cursory glance!

> When speaking, english usually has more syllables per unit time than mandarin, so english has the highest spoken information density of any language.

Of the seven languages in the study, using 20 specific short texts, that were originally written in English then translated (well?) in other languages.

They recognized this issue and accounted for it. From the paper:

Since the texts were not explicitly designed for detailed cross-language comparison, they exhibit a rather large variation in length. For instance, the lengths of the 20 English texts range from 62 to 104 syllables. To deal with this variation, each text was matched with its translation in an eighth language, Vietnamese (VI), different from the seven languages of the corpus. This external point of reference was used to normalize the parameters for each text in each language and consequently to facilitate the interpretation by comparison with a mostly isolating language (see below).

It shouldn't be particularly surprising that english comes out ahead. It has a huge vocabulary, tons of phonemes, and makes many parts of speech optional. It lacks tones, but would probably have to sacrifice some phonemes to stay comprehensible.

That just deals with the variation in length of the texts, not the effect of translation quality or other possible problems with the experiment, like written -> spoken conversion.
Russian is actually less dense than even english, but compensates it with flexibility. The phrase above could be written in a lot of different ways, which would emphasize different parts of the sentence, and give it a different tone.
The written a lot of different ways is also the case with Arabic, except for the one word limitation, since obviously Subject-Verb-Object encoding in one word requires the word order.

If we loosen that req, it gets more interesting. I assume Russian will line up with the following.

In Arabic, the default in formal Standard Arabic (not the dialects, that is another can of Bedouin worms) is Verb Subject Object. You can, however, have VSO, SVO, OSV, OVS, depending on context. I think you decline and conugate verbs in Arabic as you would in Russian. So you can probably play with written form, emphasizing different parts as you suggest in a similar way.

Am I way off? That is what I gathered from Russian/USSR republic kids I have befriended over the years. Not sure if that scans.

I disagree: Translations are rarely accused of being as good or as comprehensive as the original. The fact that you can tell a story in 300 pages in Arabic and 300 pages in English is irrelevant.

Iverson received a turing award[1] for his work on this subject.

[1]: http://www.jdl.ac.cn/turing/pdf/p444-iverson.pdf

Cool. Will definitely read more about this then.
Those crazy middle-easterners. How can they calculate with such terse number notation? As if 27 is more readable than XXVII! ;-)
I think that the issue is that just measuring simplicity in terms of number of lines is a bad metric. You can have extremely complex expressions in a single statement that are at least as hard to read and debug as an equivalent, much longer piece of code that employs temporary variables and single-purpose statements.
I disagree fundamentally.

I have noticed every page I scroll causes a comprehensive loss of around 90%, so in reading something that is 10 pagefuls long, I might only be able to produce a tiny part of the program.

Your milage may vary.

I find not scrolling, and just moving my eyes, I rapidly absorb the program, and I find most bugs just by reading the code. This practice is absolutely impossible for me if I have to scroll very far and made difficult by scrolling at all.

It is for this reason that I find simply counting the actual words to be an excellent estimate of complexity.

By the way: There are several temporary variables in that code; c:: creates a view called "c" which automatically updates whenever the dependent variables on the right side change.

Yes, the research literature on software development has consistently found that code size is the best measurement of complexity and predictor of error rates. (Sorry I don't have citations handy but we've discussed this many times on HN, and there's a recent study in the book "Making Software" that adds to it.) What's interesting is how strongly this goes against what most people think they know about good programming and clear code.
I just had to troubleshoot a small helper app that took some HTTP input and wrote to a DB. The code was in C# and had about 10 files spread over 3 namespaces, plus a separate test infrastructure project. All sorts of factory models were used to setup an "HTTP pipeline" and authentication modules. The problem I had to fix: after a server upgrade, authentication was broken.

After digging around for a while, I discovered there was no bug. The partner's client code had the auth disabled, and the pervious server was misconfigured to not require auth. All which would not have been a problem if the system just did an "if headers.auth != "Basic ..." - but buried in this forest of stuff, it was overlooked.

It seems that some developers just love their edifices. They build all this "infrastructure", expanding code by an order of magnitude or more. It's considered good and robust and so, so much writing online is dedicated to this pursuit. I think it gives those programmers a feeling of import, as if they're really architecting something, not just pushing a few form fields around.

Even on the line by line basis, it's shocking how they love verbosity. Type inference? Nope, that makes things too compact and hard to read. Higher order functions to wrap up common patterns? Too difficult to understand. I'm not sure if developers simply lack the tiny bit of extra intelligence, or if they've tried it and honestly concluded that overflowing verbosity is the key to readability. Either way, it's sad, and holding back progress slightly.

Right, there seems to be a group thinking like this and a group aggressing against it and vice versa. I recently had a discussion about it and the 'architecting' bunch (we need 20 layer deep directories with 1000s of file with < 10 lines / file) keep shouting about maintainability. The problem is, that after 25 years of professional coding in many different circumstances, I see that most good programmers are much quicker to understand the 'non architected' (putting between '' because good code is not gibberish, it is architected but not by randomly generating design patterns and applying them) and the not so good programmers say that the 'architected' code is much more maintainable but take weeks or months longer to do anything worthwhile as they are 'grokking the architectural choices'.
If someone produces smaller and faster code than me, then I should want to learn from it. I wonder why other people have the exact opposite reaction.

Why do you think that is?

I think that "It's what I'm used to." is the main reason - intellectual comfort zone.

Having learned BASIC, FORTRAN and Pascal, C seemed like line noise - at first. As did PERL. And then k.

Btw, COBOL seemed "too verbose".

Once I actually started writing many k programs and then reading even more of them, I was able to recalibrate for the abstraction/density. I moved my intellectual comfort zone. Ironically, I was already there with mathematics. However, programming languages were different :).

Now, as a result, every time I have to read Java, I suffer from a kind of fatigue - having to read way too much code to glean the writer's intent. I just want them to get to the F'ing point.

N.B. - Mathematical literature/writing went through this same transition during the Renaissance. Equations were described in natural language (not unlike COBOL). A simple polynomial could require a paragraph of text to describe.

I'm not sure -- I know that after the fourth or fifth time solving a problem on projecteuler.net in 20 lines of code and seeing someone post a 1-line J/K solution, I went and downloaded J. I even managed to solve a few euler problems with it, which I regard as a large accomplishment for a novice. I like to tell people I've written a whole twenty or so lines of code in J!
I do enjoy learning about such things, but, for most of the work I do, performance is nowhere near at the top of the list of things I care about. Also in the past I've been burned by code that's small/fast but is otherwise utterly unmaintainable. I'm not saying that's the case here, but... past experience, and all that tends to color perceptions.

I think with a language like k or q, which appears to be purpose-built for certain types of problems, people look at it and get easily confused and discouraged because it's so different from all the more mainstream general-purpose programming languages they're used to. And it's a lot easier to put down something you don't understand than to admit you don't get it, or to spend lots of time learning something that may not be of much use to you. Kinda sucks, but it's often human nature.

> I think with a language like k or q, which appears to be purpose-built for certain types of problems,

The thing is, it's not purpose built, and it doesn't even appear to be if you suspend your disbelief. The only reason you'd think it is purpose built is because "well, it can't be this short if it wasn't purpose built". But if you go over the manual, and find special built operators, please tell us what they are.

e.g., to compute an average, you can use the function avg:{(+/x)%#x} - with the exception of parentheses, every character has an orthogonal function. Similarly, the maximum subarray sum solution mss:|/0(0|+)\ ; and there are many others. And it's not just math stuff - http://nsl.com has lots of other examples of many kinds -- and most importantly -- is an operating system + GUI not general enough?

One benefit to a short program is that there's not much code to rewrite if you can't read something.

This doesn't happen very often, but I find the thought comforting.

It all depends on how you define "small and fast".

Comments obviously are not code, so it's reasonable to complain about lack of comments.

You suggested wordcount, I think wordcount is good, so it's reasonable to complain about single letter words rather than descriptive words.

uberalex's suggestion for reformatting wouldn't change the algorithm or speed. It would simply spread operations across more lines. That also seems like a reasonable thing to ask, to me. They can learn your method either way.

Edit: I mean, I'm sure fitting more on the screen is valuable, but people already know how to fit many times as much code onto a screen. They avoid it on purpose for whatever reason.

>I mean, I'm sure fitting more on the screen is valuable, but people already know how to fit many times as much code onto a screen. They avoid it on purpose for whatever reason.

I think this reason (whatever it happens to be) is probably wrong.

I don't understand it either. But it happens everywhere, not just in code. Most people are only interested in the "truth" and "facts" as long as it fits within their existing world view.

And things like K rarely do.

Hey, just read bytecode, then. As small and fast as you can get.
But, is number of lines a particularly good size measurement?

Is there evidence one way or the other on whether it's better to measure size with, say, number of lines, number of tokens, or number of nodes in a parse tree? or something else?

My understanding of the literature is that no one has found a better way to measure program complexity than lines of code. In particular, the fancier metrics (cyclometric complexity and so on) don't add any value over simple LoC.

We've debated the merits of counting tokens before, but I don't recall anyone mentioning a study about it. In real programs—i.e. when you're dealing with idiomatic code as opposed to something designed to game a metric—I doubt that LoC, lexical length, and number of tokens differ much.

Scrolling doesn't bother me, but unnecessary code does. So long as I can see the algorithm on the screen that's fine. Love the kOS idea, keep on it!
Actually it is a good metric, certainly to the first order. Yes, you can have a line or three that is more complex than the rest but practically it isn't going to reduce the line count that much.

And token counts don't help as code that insists that each brace must be on its own line detracts from readability. For one thing it pushes the last bit of the function off the bottom of the screen meaning you have to scroll.

A line that is overly complex is eventually get rewritten.

I say this as someone who has written large bodies of code in sigma 5 assembly, Fortran II and IV bliss 36, C, C++, and Lisp. Perhaps more to the point, these days I read large bodies of code measured in millions. Lines of code dictates how long it will take to understand it.

Peter Norvig in paip gives some examples of small code and how it can be exceedingly clear.

It is a pity that they don't use Chinese symbols but invented their own. I really hope one day I can write programs in Chinese, i.e., programs are also valid Chinese, and have the same meaning.