The fact that a compiler is allowed/encouraged to silently remove whole sections of code because of some obscure factoid is an amazing source of footguns.
At least the warnings are getting a bit better for some of these.
Without these optimizations, you can't write fast scientific code in C. This was realized back in the early 1980s and it's why those rules were added.
In Fortran the aliasing rules are even stricter: given two arrays passed in as arguments the compiler can assume that they do not overlap, for example. I remember messing that up as a student long ago and getting strange results. The Fortran rule was to enable vectorization, which has been done for many decades.
You doubt it because you aren't a compiler developer, so you aren't aware of the history. Try taking some of the classic Fortran scientific programming libraries, such as LAPACK, recoding them in C, and then see what happens when you need to generate code with a compiler that doesn't do any of the optimizations the article complains about and has no undefined behavior. Then you'll figure out why.
My understanding is the reason FORTRAN is faster than C isn't because of stupid stuff like noalias and the like. It's because FORTRAN has arrays and C doesn't.
You're right, sort of. Because C doesn't have proper arrays, you make up for it by passing a pointer to the first element and the range. So for C to do as well as Fortran on matrix operations the compiler developers need help. One source of help is that if you have a pointer to double and a pointer to int, you can assume that writing pdouble[k] doesn't alter anything reachable as pint[m].
Why would that put C at a disadvantage performance-wise? Array decay is, with hindsight, an unfortunate feature, but optimizing access to contiguous memory is very low-hanging fruit as far as compiler optimizations go.
If you want better safety, runtime and compile time performance the lowest hanging fruit would be to fire the C standards committee and hard fork C++/C into C++ and C only ecosystems.
Cool, make a sci-C with all the optimisations allowed that is trivially interoperable with simple-C. Mangling the latter into the former serves no-one.
How much slower would removing the UB footgun make scientific code? Do you have benchmarks? Because we have endless firsthand accounts that it makes standard C useless.
I don't get how the fact that the compiler can remove or modify the code was thought to be a good idea. I get removing unused functions, but not conditions and changing the flow of the code. If there is unreachable code, best to issue a warning and let the programmer fix it. The compiler should optimize without changing the semantic of the code, even if it contains undefined/unspecified behavior.
To this it's impossible to write C without using a ton of non standard attributes and compiler options to just make it do the correct thing.
> The compiler should optimize without changing the semantic of the code, even if it contains undefined/unspecified behavior.
That is what it does. "Undefined behavior" is a lack of semantics, so it is preserving semantics when it leaves those paths out. You can make a "defined C" with `-fsanitize-trap=undefined`, but C was never a high level assembler, and performance is critical for C users too.
C was always a high-level assembler. UB was meant to be "this cannot be defined portably, refer to your CPU's documentation". What do you gain by making C unusable by default? Just make it 5% slower by default, but possible to reason about and give people the option to shoot themselves in the foot. I don't know what more you need than the creator of the language telling you "this is not possible to implement sanely".
> UB was meant to be "this cannot be defined portably, refer to your CPU's documentation"
You're probably thinking of implementation-defined behaviour, which is the term for behaviour that, although not fully specified by the standard, is a valid behaviour for the program.
Undefined behaviour is different. From the article: "Undefined behavior gives the implementor license not to catch certain program errors that are difficult to diagnose."
Many people don't notice the distinction, and their intuition about what compilers should do with their code matches implementation-defined behaviour; and therefore they complain when, for UB, it does not behave this way.
There is a very strong argument for reclassifying many specific UBs as implementation-defined behaviours, but that's a rather different conversation from "the compiler should always do something sensible when encountering UB".
Even assuming the 5% number were correct (depending on how expansive your definition of UB is, it may not be), asking everyone who doesn't adjust their compiler flags to accept a 5% slowdown for some theoretical benefits is at odds with economic reality.
Even assuming the 5% number were correct (depending on how expansive your optimisations with UB assumptions are, it may not be), asking everyone who doesn't adjust their compiler flags to accept their programs being silently miscompiled for some theoretical benefits is at odds with economic reality.
> "Undefined behavior" is a lack of semantics, so it is preserving semantics when it leaves those paths out.
It's not preserving semantics, it's inferring a valid program from an invalid one, but since the programmer entered an invalid program, unilaterally deciding that the valid program was the intended program seems dubious at best. These should really be errors, or warnings at the very least.
Very often, a function inlined can be determined, at the place where it is expanded, to have substantial sections of dead code, based on knowledge of the values of arguments passed to it in that place. Expanded in a different place, different parts are dead. Warnings about the dead parts would lead you to turning off those warnings, so they are never turned on in the first place.
This gets more complicated when an inlined function calls another inlined function. The whole inner function may be in the middle of dead code, and thus everything it calls, too, and so all be elided. This sort of thing happens all the time. These elisions don't typically make code obviously much faster, cycle-wise, but they do make it smaller, with a smaller cache footprint. Cache footprint has a huge effect on overall performance.
In principle, the compiler could arrange to put the dead code in some other page, with a conditional jump there that never happens, consuming just another branch prediction slot. But branch prediction slots are cache, and the branch, though never taken, nonetheless consumes a slot. A processor ISA could, in principle, have a conditional branch instruction that is assumed by branch prediction always to go one way, and so not need to consume any branch prediction cache. But I don't know of any. RISC-V does not seem to be entertaining plans to specify such a branch.
Several extant chips do allow "hint" prefixes for branches, but I think they are ignored in current chips, according to experience where they were determined generally to be wrong. This is unfortunate, as sometimes the most frequently taken direction is not the one you want to be fastest. E.g., when spinning while watching an atomic flag, you want the looping branch to be assumed not taken, to minimize latency once the flag is clear, even though it most frequently is taken in recent history. (High-frequency trading code often has to resort to trickery to get the desired behavior.)
(There is a famous story about engineers at Digital Equipment Corporation, whence we got PDP-11s and Vaxen, and thus, indirectly, Unix. A comprehensive cycle count showed that a huge proportion of the instructions executed in their OS kernel were just one instruction. They moved heaven and earth to optimize this one instruction, but the kernel with the new instruction was exactly 0% faster. It turned out the instruction was used only in the idle loop that ran while waiting until something useful could be done. This really happened, and the same mistake is made again and again, to this day: e.g., people will sincerely swear that "rotl" and "popcnt" instructions are not important, on about the same basis.)
Intel recently implemented an atomic test-and-loop instruction that just stalls execution until an identified cache line is touched, thus avoiding the branch prediction stall getting out of the loop. I have not heard of anybody using it.
RISC-V, incidentally, suffers under exactly the noted delusion, so all existing chips lack all of the supposedly unimportant instructions. I have seen a claim that "RVA22", its general-purpose computing platform, will have them, but cannot find any evidence for it online.
Absolutely, those optimisations should be opt-in, otherwise it's impossible to reason about the correctness of your code. At work we had to replace some arithmetic by inline assembly, as there was literally no other way of making the compiler generate the correct expression.
I'm very curious to learn more about this as I thought it was easily possible to opt out of the overly aggressive defaults of GCC, Clang, and the like.
It is possible - but not whilst maintaining standards-compliance and there are always/often another one where you least expect it.
At the root is taking a “assume the programmers know what they are doing” language and turning it into a “assume the programmers are drooling morons” language.
I think it's the opposite: compiler writers under pressure to produce fast code assume that the programmers are smart, not morons, and that they understand the rather complex rules.
For example: the compiler is assuming someone isn't ignorant and knows, for example, if they want to write a variable as one type and read it as another, they need to consider two things: they are now in implementation-dependent territory (big endian vs little endian effects, for example), and they need to use unions, not just casts, if the same data can be interpreted as two different types and neither is array of char. Failure to use unions properly triggers aliasing problems.
Amusingly, unions are also defined so that operations that pun types are undefined. Gcc has private extensions that provide a way to pun types in unions, but those are not portable.
Specifically: if you have a union with members a and b, and you assign into member a, then reading from member b afterward is, by the C and C++ Standards, usually undefined. You are presumed to have some way to know that member a is live, and use that. You can then write into b, and later read that back out.
The C99 and C11 standards say that the behavior is unspecified (in standard-ese, "unspecified" and "undefined" are different), meaning that the standard doesn't say what you get. Of course that's true: the result is different on a big-endian and a little-endian machine. Packing the bytes of a double might generate a signalling NaN so you get a trap when you read it, or not, and the standards don't require that floating point representations match the IEEE standard, either.
It was many months ago and I no longer remember the full details, plus it's proprietary code, so I can't tell them to you. But the gist was that after optimisation, some floating point arithmetic was done in different order by different compilers. This particular expression really needed to be computed in exact order and, since nothing else worked, we had to just write it in compiler-specific assembly. IIRC some language extensions are coming to better specify float semantics on a per-function basis that might help one day.
That's a compiler bug. Floating-point math shouldn't be reassociated unless you were using -ffast-math, which is indeed a poorly named option that does bad things, but one many people demand.
In Fortran the aliasing rules are even stricter: given two arrays passed in as arguments the compiler can assume that they do not overlap, for example. I remember messing that up as a student long ago and getting strange results. The Fortran rule was to enable vectorization, which has been done for many decades.