Hacker News new | ask | show | jobs
by Aardwolf 34 days ago
What's the reason that C didn't define the order of this?

The horrible undefined behavior of signed integer overflow at least can be explained by the fact that multiple CPU architectures handling those differently existed (though the fact that C even 'attracts' its ill-defined signed integers when you're using unsigned ones by returning a signed int when left shifting an uint16_t by an uint16_t for example is not as forgivable imho)

But this here is something that could be completely defined at the language level, there's nothing CPU dependent here, they could have simply stated in the language specification that e.g. the order of execution of statements is from left to right (and/or other rules like post increment happens after the full statement is finished for example, my point is not whether the rule I type here is complete enough or not but that the language designers could have made it completely defined).

10 comments

The short answer is because C was designed to give leeway to really dumb compilers on really diverse hardware.

This isn't quite the same case, but it's a good illustration of the effect: on gcc, if you have an expression f(a(), b()), the order that a and b get evaluated is [1] dependent on the architecture and calling-convention of f. If the calling convention wants you to push arguments from right to left, then b is evaluated first; otherwise, a is evaluated first. If you evaluate arguments in the right order, then after calling the function, you can immediately push the argument on the stack; in the wrong order, the result is now a live variable that needs to be carried over another function call, which is a couple more instructions. I don't have a specific example for increment/decrement instructions, but considering extremely register-poor machines and hardware instruction support for increment/decrement addressing modes, it's not hard to imagine that there are similar cases where forcing the compiler to insert the increment at the 'wrong' point is similarly expensive.

Now, with modern compilers using cross-architecture IRs as their main avenue of optimization, the benefit from this kind of flexibility is very limited, especially since the penalties on modern architectures for the 'wrong' order of things can be reduced to nothing with a bit more cleverness. But compiler developers tend to be loath to change observable behavior, and the standards committee unwilling to mandate that compiler developers have to modify their code, so the fact that some compilers have chosen to implement it in different manners means it's going to remain that way essentially forever. If you were making a new language from scratch, you could easily mandate a particular order of evaluation, and I imagine that every new language in the past several decades has in fact done that.

[1] Or at least was 20 years ago, when I was asked to look into this. GCC may have changed since then.

I'd say it's more like C was designed from really dumb compilers on really diverse hardware. The standard, at least the early versions of it, was more to codify what was out there than to declare what was correct. For most things like this in the standard, you can point to two pre-standardization compilers that did it differently.
Kind of both? There were pre-standard compilers, but when they created the standard, they tried to make it so that one could write really dumb compilers and still fulfill the standard.
I suspect it was also the case that if they didn't make it easy for platform vendors to implement compilers then they wouldn't do it.
gcc used to do this back in the day. Parameter expressions left to right on x86, and right to left on Sparc. I spent a week modifying a bunch of source code, removing expressions with side effects from parameter lists, into my own temporary variables, so that they would all evaluate in the same order.
Sethi-Ullman register allocation reorders subexpression evaluation to achieve efficient register allocation: https://dl.acm.org/doi/10.1145/321607.321620

With modern register allocators and larger register sets, code generation impact from following source evaluation is of course lower than it used to be. Some CPUs can even involve stack slots in register renaming: https://www.agner.org/forum/viewtopic.php?t=41

On the other hand, even modern Scheme leaves evaluation order undefined. It's not just a C issue.

Applying the increment or decrement operators over the same variable more than once on the same line should be a compile-time error.

Anyway, yes, this one example has an obvious order it should be applied. But still, something like it shouldn't be allowed.

> Applying the increment or decrement operators over the same variable more than once on the same line should be a compile-time error

That would be nice, but don't forget the more general case of pointers and aliasing:

    int a = 5;
    int *pa = &a;
    printf("%d", (a++ + ++*pa));
The compiler cannot statically catch every possible instance of a statement where a variable is updated more than once.
Well, aliased updates are undefined behavior already.
Not in C, unless at least one of the pointers were marked `restrict`.
Honestly, having increment in expressions rather than a statement feels like more of a footgun than a benefit. Expressions shouldn't mutate things.
I think the history of this is that these operations were common with assembly programmers, so when C came along, these were included in the language to allow these developers to feel they weren't leaving lots of performance behind.

Look at the addressing modes for the PDP-11 in https://en.wikipedia.org/wiki/PDP-11_architecture and you'll see you can write (R0)+ to read the contents of the location pointed to by R0, and then increment R0 afterwards (so a post increment).

Back in the day, compilers were simple and optimisations weren't that common, so folding two statements into one and working out that there were no dependencies would have been tough with single pass compilers.

You could argue that without such instructions, C wouldn't have been embraced quite so enthusiastically for systems programming, and the world would have looked rather different.

Additionally, those indirect memory instructions ended up disappearing because it complicated virtual memory implementations. It was a pain in the ass to describe the multiple places in memory an instruction could be accessing and which actually faulted to a fault handler, not to mention having to roll back all that state on more complex designs.
I worked on a more recent custom AI ISA that had that too. Pretty neat; I'm surprised it's not more common. I guess it doesn't matter so much now that memory is so much slower than ALU ops.
Python recently went the other way and added an assignment expression. I actually wish more languages would go further and add statement expressions instead of having to imitate them with IIFEs.

C just wouldn't be C without things like a[i++]

If the past few weeks of CVEs indicate anything, it's that C being C maybe isn't a good thing...
Those things are for pointer golf and writing your entire logic inside the if statement.

Both are favorite idioms of C developers. And they are ok if done correctly, clearer than the alternative. They are also unnecessary in modern languages, so those shouldn't copy it (yeah, Python specifically).

In any language where the practice of iteration isn't achieved via C-style for-loops, having an operator devoted to increment just doesn't make sense (let alone four operators, for each of pre/post-increment/decrement). This is one of those backwards things that just needs to be chucked in the bin for any language developed post-2010.
When used well it makes for compact readable code. I don't see what it has to do with for loops or operators specifically. For example you can do the same in scheme while iterating by means of tail recursion.
> I don't see what it has to do with for loops or operators specifically.

The reason that these operators pull their weight in C is because iteration over arrays is achieved by manual incrementation (usually via the leading clauses of the for-loop) followed by direct indexing. Languages with a first-class notion of iteration don't directly index in this way, which overwhelmingly eliminates not only the vast majority of array indexing operations in codebases but also the need to manually futz with the inductive loop variable. Case in point, Rust doesn't have `++` in any form, and it doesn't miss it, because Rust has first-class iteration; on the then relatively rare occasion where you want do want to increment, you can do `+=1`, which doesn't have the footguns of `++` due to assignment being a statement rather than expression, while leading to a simpler language due to leveraging the existing `+=` syntax rather than needing a whole new set of operators.

For loops are hardly the only usecase and built in iteration constructs frequently fall short. For example any mildly complex loop that involves pointer juggling can benefit.

> which doesn't have the footguns of `++` due to assignment being a statement rather than expression,

So then I implement the local equivalent of inc( v ) and ... same issue, right? Plus with rust macros is there any technical reason you can't trivially implement ++ for yourself? That's the case for most lisps that I touched on earlier.

I always hate C-style for-loops because even thought I learned C over 40 years ago, I can never remember whether the increment comes before the test or the test comes after the increment. Fortunately, modern IDEs let me continue to be ignorant on those occasions when they’re necessary (usually because I need the index for some reason).
int d = foo ? bar() : baz();

I think if anything people have been leaning more and more into expressions over statements, because when everything is an expression you end up being able to walk the gradient of complexity a bit more nicely than when you end up with a thing that just has to be broken down to a bunch of statements.

Expressions are nice specifically because they don't tend to mutate things. The ternary operator is not at all the same as `a++` because you have the assign the result.
Wenn wouldn't have pearls like while (dst++ = src++);
It's valuable for compilers to be able to choose the instruction scheduling order. Standards authors try not to unnecessarily bind implementors. If post increment happened after the full statement is finished, then the original value has to be maintained until the next sequence point. Maybe the compiler will be smart enough to elide that, maybe not, but it's a lot more difficult to fix those kinds of edge cases than to say sequencing is undefined.
But this is not valuable if doing so results in different numerical results, and I think that will always happen if ++ is executed at different times, there's no point in a compiler optimizing pointless code that can silently give different results elsewhere
The same rule which makes the evaluation order of a++ + ++a unsequenced also applies to (x+y+z+a+b+c) where x,y,z could be any expression (in a sane case on separate variables and without mix of pre/post increments). Breaking questionable code and introducing UB where reordering changes result is just a side effect of this.

Just switching between left to right or right to left wouldn't be that useful but it also permits to interleave the subexpression evaluation. Grouping memory fetches/writes, taking into account how many execution units and registers of different kinds a CPU has can have some performance benefits.

For example if you have something like `++a[0] + ++a[1] + ++a[2] + ++a[3]` instead of evaluating each increment one by one both GCC and Clang will vectorize it loading all 4 values from memory using single simd instruction, incrementing and then writing result back to memory. And if you add fifth one (but not 8) which needs to be handled using regular instruction, that will be done after the first 4. If standard defined that left subexpression of addition is fully evaluated before the right expression that wouldn't be allowed.

> If standard defined that left subexpression of addition is fully evaluated before the right expression that wouldn't be allowed.

I'm no expert, but surely this would still be allowed so long as the compiler can prove that incrementing a[0] has no effect on the value of a[1]?

Your compiler does many optimizations that break numerical reproducibility, especially in floats. I reviewed a PR the other day that wrote X=AB+(CD)+E;

And when I checked 3 different compilers, each of them chose a different way to use FMAs.

Even with integer math, you can get different numerical results via UB (e.g. expressions with signed overflow one way and not another).

Floating point reproducibility and cross platform determinism requires strict adherence to the IEEE standard and disabling of fused instructions.
The point being that there are many other places where reproducibility fails to hold, especially when optimizations are involved. The standard doesn't mandate a way to disable contraction, nor the existence (or absence) of -ffast-math and others. They're simply different, legal compilations within the broad scope allowed under the standard.
It would only make a difference in cases that are currently UB, so there is no program valid under current C that would be pessimized by this change.
It's a language feature that was in K&R, and the rules around sequencing were introduced in C89. There were good reasons to believe it would pessimize code in the following decades. Dennis Richie himself pointed out that Thompson probably added the operators because compilers of the time were able to generate better code that way.
The C standard doesn't define things where two or more historical compilers disagreed and there wasn't an obviously correct way. This is defined behavior (left to right, assignment last) in Java, which is a different language.
Probably because when C was standardised there were already multiple implementations, and this was an area where implementations differed but it wasn't viewed as important enough to bring them in line with one approach.
> What's the reason that C didn't define the order of this?

I didn't open TFA but my first thought was "Is this even defined?".

It kinda make sense that suck fucktardedness could be not defined.

The only other reasonable option is to make such garbage a compile time error. There is no reasonable definition of what code like that should do and if you write it in the real world you need find better job fit. I'd normally say McDonald's is hiring, but they don't want people like that either
> What's the reason that C didn't define the order of this?

The OP article provides experimental details but annoyingly does not give the big picture w.r.t. C language specifications (provided in the links though).

There are three concepts at interplay here which is at the root of the problem; 1) Expressions (evaluates to a single value) 2) Statements (tells computer to perform an action) and 3) Sequence Points (specific moment during execution when all previous side-effects are guaranteed to be complete).

It is the sequence points during the evaluation of expressions which is important to understand here. From https://en.wikipedia.org/wiki/Sequence_point;

In C and C++, a sequence point defines any point in a computer program’s execution at which it is guaranteed that all side effects of previous evaluations will have been performed, and no side effects from subsequent evaluations have yet been performed. They are a core concept for determining the validity of and, if valid, the possible results of expressions...

1) An expression's evaluation can be "sequenced before" the evaluation of another expression. (Equivalently, the other expression's evaluation can be "sequenced after" that of the first.)

2) The expression's evaluation is "indeterminately sequenced", meaning that one is "sequenced before" the other, but which is unspecified.

3) The expression's evaluation is "unsequenced", meaning the operations in each expression may be interleaved.

The "Order of Evaluation" states; (from https://en.cppreference.com/c/language/eval_order)

"Order of evaluation of the operands of any C operator, including the order of evaluation of function arguments in a function-call expression, and the order of evaluation of the subexpressions within any expression is unspecified (except where noted below)."

The "Single Update Rule" states; (from https://www.accellera.org/images/eda/sv-bc/0282.html)

Between consecutive "sequence points" an object's value can be modified only once by an expression. The C language defines the following sequence points:

       Left operand of the logical-AND operator (&&).
       Left operand of the logical-OR operator (||).
       Left operand of the comma operator.
       Function-call operator.
       First operand of the conditional operator.
       The end of a full initialization expression.
       The expression in an expression statement.
       The controlling expression in a selection (if or switch) statement.
       The controlling expression of a while or do statement.
       Each of the three expressions of a for statement.
       The expression in a return statement.
Putting all of the above together in OP's code snippet; The "single update rule" fails for the expression since the variable a is modified multiple times between two consecutive sequence points and hence the result is UB.

For more detailed explanations, see Angelika Langer's Sequence Points and Expression Evaluation in C++ - https://angelikalanger.com/Articles/VSJ/SequencePoints/Seque...

It's defined. And called "operator precedence", both post/pre-increment have a higher precedence than the single "+".

At least according to this: https://en.wikipedia.org/wiki/Operators_in_C_and_C%2B%2B#Exp...

I think the main confusion here comes from the fact that "a" is just a value, not a pointer, where it matters when the value/address which the pointer points at is accessed (before of after the increment of the pointer's own 'value').

Anyway… my C skills are rusty. Maybe I get it wrong. :) In any case I always would use brackets to avoid any ambiguity in constructs like this.

Nope. Order of evaluation and operator precedence are completely unrelated. They should have been defined to be the same, but instead order of evaluation was left undefined. So if you write ++a + a++, operator precedence means this will be interpreted as (++a) + (a++), not say ++(a + a)++, but it is up to the compiler whether to execute ++a or a++ first, rather than executing them left to right.
Sometimes it helps to test. Which I just did. :-)

Actually the compiler (at least clang) warns about this:

    $ gcc -W -Wall test.c -o test
    test.c:8:7: warning: multiple unsequenced modifications to 'a' [-Wunsequenced]
            a = a++ + ++a;
                 ^    ~~
    1 warning generated.
The undefined behaviour stems from the fact that "a" is modified multiple times between the "sequence points" (so it's irrelevant to the actual problem that this happens with ++, --, pre-, or, post-, or in which order) We only can modify the variable safely once on the right side without entering bizarro world.

A construct like this certainly can be confusing.