Hacker News new | ask | show | jobs
by layer8 648 days ago
TFA is misunderstanding. As he cites from the C standard, “Undefined behavior gives the implementor license not to catch certain program errors that are difficult to diagnose.” Since it’s difficult (and even, in the general case of runtime conditions, impossible at compile time) to diagnose, the implementor (compiler writer) has two choices: (a) assume that the undefined behavior doesn’t occur, and implement optimizations under that assumption, or (b) nevertheless implement a defined behavior for it, which in many cases amounts to a pessimization. Given that competition between compilers is driven by benchmarks, guess which option compiler writers are choosing.

The discussions in comp.lang.c (a Usenet newsgroup, not a mailing list) were educating C programmers that they can’t rely on (b) in portable C, and moreover, can’t make any assumptions about undefined behavior in portable C, because the C specification (the standard) explicitly refrains from imposing any requirements whatsoever on the C implementation in that case.

The additional thing to understand is that compiler writers are not malevolently detecting undefined behavior and then inserting optimizations in that case, but instead that applying optimizations is a process of logical deduction within the compiler, and that it is the lack of assumptions related to undefined behavior being put into the compiler implementation, that is leading to surprising consequences if undefined behavior actually occurs. This is also the reason why undefined behavior can affect code executing prior to the occurrence of the undefined condition, because logical deduction as performed by the compiler is not restricted to the forward direction of control flow (and also because compilers reorder code as a consequence of their analysis).

3 comments

> This is also the reason why undefined behavior can affect code executing prior to the occurrence of the undefined condition, because logical deduction as performed by the compiler is not restricted to the forward direction of control flow (and also because compilers reorder code as a consequence of their analysis).

According to Martin Uecker, of the C standard comittee, that is not true:

> In C, undefined behavior can not time travel. This was never supported by the wording and we clarified this in C23.

https://news.ycombinator.com/item?id=40790203

It is really hard to prevent this in an optimizing compiler. I don’t think it’s realistic. For example, loop invariants can be affected by undefined behavior in the loop body, and that in turn can affect the code that is generated for a loop condition at the start of the loop, whose execution precedes the loop body. This is a general consequence in static code analysis. Even more so with whole-program optimization.
It's also completely necessary to have any sort of reasonable language semantics. The goal is to have programmers be able to write code that does what they intend. With the C23 addition, time travelling UB doesn't exist, so programmers can write code that does what they intend up to the point of invoking UB. Good enough.

Let's say that's too difficult for compiler writers, so we bring back time travelling UB. That implies UB on a future execution path means the entire execution path has no semantics. We now have to ensure there is no UB on any future execution path to meet our goal. There are basically 4 options:

1. Rely on programmers to never write UB. This has not worked out historically.

2. Compilers must detect and/or prevent all UB statically. This is obviously impossible.

3. Runtimes must exhaustively detect and/or prevent all UB. This is both infeasible and expensive.

4. Give up on semantics for essentially all nontrivial programs. This is the situation today, but if we're going to make this the official position why should we even have a standard?

Maybe I don't understand something, but for me it seems pretty easy. What is needed to be done:

1. Make a list of all UB

2. Define the sensible compiler behavior in each case (for example, let MAX_INT+1 to calculate into MIN_INT on x86_64, just because `add` on x86_64 does that)

3. Treat this as a part of a standard, when compiling the code.

This approach allows to have different compiler behavior on different architectures, which are better suited for the architecture. Maybe on some architectures `add` on signed numbers will generate a CPU exception on overflow, so define this as a way to behave and go with it.

The requirement for “sensible” (i.e. repeatable) behavior breaks many simple, critical optimizations like maintaining the referent of a nominally un-aliased pointer in a register.

What if there’s UB & it is aliased? Some other pointer of a different type in scope also references the same value. The “sensible” thing to do when the value is updated through the alias is…?

That works for a lot of behavior but not everything. For example:

  int f(int x) {
    static int y[] = {42, 43};
    return y[x];
  }
What behavior should `f(-1)` or `f(100)` have? What is sensible?
Desugar to pointer arithmetic, try to do an dereference like

    *(y-1)
and more than likely segfault, or return the value at that address if it's somehow valid.
I'm not seeing how this is a change. C99 also said "for which".

C99: "behavior, upon use of a nonportable or erroneous program construct or of erroneous data, for which this International Standard imposes no requirements."

Martin Uecker said that something was fixed in the C23 draft, and when asked about it, pointed to the "for which" as pertaining to just that construct and not the entire program.

I'm afraid that this fellow showed himself unreliable in that thread, in matters of interpreting the C standard. In any case, a random forum remark by a committee member, is not the same thing as a committee response to a request for clarification. It has to be backed by citations and precise reasoning, like anyone else's remark.

Suppose we have a sequence of statements S1; S2; S3 where S3 contains the expression i + 1, i being of type int, and nothing in these statements alters the value of i (it is live on entry into S1 and there is no other entry). It is valid for S1 to be translated according to the supposition that i is less than INT_MAX. Because if that is not the case, then S3 invokes undefined behavior, and S3 is unconditionally reachable via S1.

The whole idea that we can have an __notreached expression which does nothing but invoke UB is predicated on time travel (time travel at program analysis time: being able to reason about the program in any direction). Since __notreached invokes UB, the implementation may assume that the control flow does not reach it and behave accordingly. Any statement which serves as a gateway to __notreached itself invokes undefined behavior, and is therefore assumed unreachable and may be deleted. This reasoning propagates backwards, and so the optimizer can simply delete a whole swath of statements.

Backwards reasoning has been essential in the implementation of compilers for decades. Basic algorithms like liveness analysis involve scanning basic blocks of instructions in reverse order! How you know that a variable is dead at a given point (so its register can be reused for another value) is due to having scanned backwards: the next-use information is a peek into the future (what will be the future when that instruction is running).

And, about the quesiton whether undefined behavior can make the whole program undefined, the answer is maybe. If there is no way to execute the program such that an undefined behavior is avoided, then the whole program is undefined. If the situation can be deduced while it is being translated, then the translator can stop with a diagnostic message.

E.g.:

  #include <stdio.h>

  int main() // (void) is now deprecated in draft ISO C
  {
     printf("hello, world\n");
     return 0/0;
  }
This program does not have a visible behavior of printing the hello message. The 0/0 division is undefined, and amounts to a declaration that the printf statement is unreachable. The implementation is free to delete that statement, or to issue a diagnostic and not translate the program.

Uecker is right in that there are limits on this. If a program issues some output (visible effect) and then performs input, from which it obtains a value, and that value is then embroiled in a calculation that causes undefined behavior, that previous visible effect stands. The whole program is not undefined. It's something like: the program's execution becomes undefined at the point where it becomes inevitable that UB shall occur. That could be where the value is prepared that will inevitably cause the erroneous calculation. So, as far back as that point of no return, the implementation could insert a termination, with or without a diagnostic.

Undefined behavior is not a visible behavior; it doesn't have to be ordered with regard to visible behaviors.

> implementor (compiler writer) has two choices: (a) assume that the undefined behavior doesn’t occur, and implement optimizations under that assumption, or (b) nevertheless implement a defined behavior for it, which in many cases amounts to a pessimization.

No, they have two choices: (a) assume that the undefined behavior doesn't occur, and implement any output code generation whatsoever under that assumption, or (b) define a behavior for it, and implement output code generation based on that assumption which in many cases amounts to a pessimization.

Optimization isn't relevant. Assuming it can't happen and then continuing to generate code as though it can't happen is all that matters. You can't make any assumptions, including that disabling optimization will change the output code.

> the implementor (compiler writer) has two choices: (a) assume that the undefined behavior doesn’t occur, and implement optimizations under that assumption, or (b) nevertheless implement a defined behavior for it, which in many cases amounts to a pessimization.

No. The implementor has three choices: (1) Ignore the situation altogether; (2) behave according to documentation (with or without a warning); or (3) issue an error and stop compilation.

Consider

    for (int i=0; i>=0; i++);
(1) Doesn't attempt to detect UB, it just ignores UB and generates the straightforward translation

          mov 0, %l0
    loop: cmp %l0, 0
          bge loop
            add %l0, 1, %l0 ! Delay slot
          ! Continue with rest of program
(2) May detect that would result in integer overflow and do something it documents (like a trap instruction, or run the whole loop, or elide the whole loop).

(3) Detects that would result in integer overflow and stops compilation with an error message.

An expressio unius interpretation—or simply following the well-worn principle of construing ambiguity against the drafter—would not permit crazy things with UB that many current compilers do.

What behavior should the following have:

  int f(int x) {
    switch (x) {
    case 0:
      return 31;
    case 1:
      return 28;
    case 2:
      return 30;
    }
  }
This code on its own has no undefined behavior.

In another translation unit, someone calls `f(3)`. What would you have compilers do in that case?

That path through the program has undefined behavior. However, the two translation units are separate and as such normal tooling will not be able to detect any sort of UB without some kind of whole program static analysis or heavy instrumentation which would harm performance.

What I would have it to do is: Return a number that is in the range of the "int" type, but there is no guarantee what number it will be, and it will not necessarily be consistent when called more than once, when the program is executed more than once (unless the operating system has features to enforce consistent behaviour), when the program is compiled for and running on a different computer, etc. I would also have the undefined value to be frozen, like the "freeze" command in LLVM. Normally, the effect would be according to the target instruction set, because it would be compiled in the best way for that target instruction set. Depending on the compiler options, it might also display a warning that not all cases are handled, although this warning would be disabled by default. (However, some instruction sets might allow it to be handled differently; e.g. if you have an instruction set with tagged pointers that can be stored in ordinary registers and memory, then there is the possibility that trying to use the return value causes an error condition.)
I would do what the standard tells me to do, which is to ignore the undefined behavior if I don't detect it.

On most platforms, that would probably result in the return value of 3 (it would still be in AX, EAX, r0, x0, o0/i0, whatever, when execution hits the ret instruction or whatever that ISA/ABI uses to mark the end of the function). But it would be undefined. But that's fine.

[EDIT: I misremembered the x86 calling convention, so my references to AX and EAX are wrong above. Mea culpa.]

What isn't fine is ignoring the end of the function, not emitting a ret instruction, and letting execution fall through to the next label/function, which is what I suspect GCC does.

So let's change it up a bit.

  typedef int (*pfn)(void);
  int g(void);
  int h(void);

  pfn f(double x) {
    switch ((long long)x) {
    case 0:
      return g;
    case 17:
      return h;
    }
  }

If I understand your perspective correctly, `f` should return whatever happens to be in rax if the caller does not pass in a number which truncates to 0 or 17?
More or less, yes.

I quibble with "should return" because I don't think it's accurate to say it "should" do anything in any specific set of circumstances. In fact, I'm saying the opposite: it should generate the generic, semantic code translation of what is actually written in source, and if it happens to "return whatever happens to be in rax" (as is likely on x64), then so be it.

In my view, that's what "ignoring the situation completely with unpredictable results" means.

Why isn't that fine? The compiler ignored the undefined behavior it didn't detect.
No. No honest person can claim that making a decision predicated on the existence of X is the same as "ignoring" X.
This is the most normal case though, isn't it? Suppose a very simple compiler, one that sees a function so it writes out the prologue, it sees the switch so it writes out the jump tables, it sees each return statement so it writes out the code that returns the values, then it sees the function closing brace and writes out a function epilogue. The problem is that the epilogue is wrong because there is no return statement returning a value, the epilogue is only correct if the function has void return type. Depending on ABI, the function returns to a random address.

Most of the time people accuse compilers of finding and exploiting UB and say they wish it would just emit the straight-forward code, as close to writing out assembly matching the input C code expression by expression as possible. Here you have an example where the compiler never checked for UB let alone proved presence of UB in any sense, it trusted the user, it acted like a high-level assembler, yet this compiler is still not ignoring UB for you? What does it take? Adding runtime checks for the UB case is ignoring? Having the compiler find the UB paths to insert safety code is ignoring?

This won't compile with reasonable compiler flags. (-Wall and a reasonable set of -Werror settings).

Now, assume that you didn't compile this with those flags; what actually happens is entirely obvious but platform-dependent. Assume amd64 (and many other architectures) where the return value is in the "accumulator register", assume that int is 32 bits. The return value will be whatever was in eax. The called function doesn't set eax (or maybe does in order to implement some unrelated surrounding code). The caller takes eax without knowledge of where it came from.

In a new language that isn't C, that function shouldn't compile at all (missing return).

In a C compiler, inserting a trap (x86 ud2, for example) might be reasonable.