Hacker News new | ask | show | jobs
by maple3142 24 days ago
Is this a correct understanding of UB in C? A program P has a set of inputs A that do not trigger UB, and a complementary set of inputs B that do trigger UB. A correct compiler compiles P into an executable P'. For all inputs in A, P' should behave the same as P. However, for any input in B, the is absolutely no requirements on the behavior of P'.
2 comments

Intuitively yes - the program will be compiled as if B-inputs are never passed to the program, and that can include eliminating code that tries to detect B-inputs.
This is a description of an imaginary compiler, evoked by the ANSI/ISO standards documents, which has never existed and will never exist. To understand what the program will do, you just have to understand the compiler behavior on your target platforms. A helpful intuition pump is: imagine the ANSI/ISO specifications simply do not exist; now what? Well, you just continue your engineering practice, the way you would for any of the myriad languages that never even had a post hoc standards document.
> just

That word is carrying a lot of weight here. Compilers are unbelievably complex these days, and it's impossible for any one human to fully understand the entire compilation process, including the effects of any arbitrary combination of compiler flags.

Any assumptions you have about what the compiler does in the face of UB will collapse on the next patch release of that compiler, or the moment somebody changes the compiler flags, or the moment somebody tries to compile the code for a slightly different OS, not to mention architecture.

There is no other way to understand what C compilers do than reading the standard.

Yet the standard does not tell you what the compilers do.

Linux works on a wide variety of platforms. It also relies on those platforms behaving predictably with respect to what the standard leaves undefined.

This description of ISO UB as a totally insane wonderland of random, malevolent semantics just doesn't describe reality.

Up until the compilers do something to your code that you don’t understand.
yeah then I have to learn how it works and what it assumes and how I can control it and maybe switch to a more well behaved compiler if it's truly insane
Every bug is the result of an imaginary computer that doesn't work exactly like my computer does and triggers a bug in my code. The code works on my machine, so this imaginary computer never existed and will never exist.

Signed vs unsigned chars, and the accompanying extension rules, have already bitten me switching between x86/ARM compilers. Confused the hell out of me when I was just starting out with C.

If you're going to interpret C as in "C on amd64, running on Linux 7.0 on an Arrow Lake Intel processor" then yes, you can get away with a lot of UB. That mitigates the problem but doesn't make it go away.

Not imaginary. Eliding checks on nullptr and integer overflow were both implemented, shipped, miscompiled the linux kernel and grew flags to disable them. I expect there are more if one goes looking.
Well yeah that just means some aspects of the imaginary compiler were in some configurations approximated by some historical compiler versions and were in some cases rejected by the community (which cares about sane semantics even for behavior left undefined by ANSI/ISO) and in some cases left in as defaults but made trivially configurable for anyone who wants to define the undefined behavior.
GCC -O1 and clang -O1 will both optimize this function under the assumption that inputs that cause signed integer overflow are never passed:

    int will_overflow(int a, int b) {
        int sum = a + b;
        if (b > 0 && sum < a)
            return 1;
        return 0;
    }
Right, good example, and both GCC and Clang offer well understood parameters for deciding, per compilation unit, what behavior you want for signed overflow (-fwrapv, -fno-strict-overflow, etc), so in reality it's quite far from spooky arbitrary nasal demons.
Wouldn’t be better to check both inputs before against the max value of that type instead of actually doing the overflow?
There are lots of better ways of doing this, but knowing why this one is bad/wrong requires the mental model described upthread.

(But also, what you describe would be incorrect, since two <MAX values can add to a value that is >MAX, and overflow)

> But also, what you describe would be incorrect, since two <MAX values can add to a value that is >MAX, and overflow

I was maybe unclear. I meant, if you know a sum can introduce overflow (because you have a check right after), why not check the inputs before doing the sum, instead of checking the sum?

Yes, that's a good summary.