Hacker News new | ask | show | jobs
by layer8 651 days ago
It is really hard to prevent this in an optimizing compiler. I don’t think it’s realistic. For example, loop invariants can be affected by undefined behavior in the loop body, and that in turn can affect the code that is generated for a loop condition at the start of the loop, whose execution precedes the loop body. This is a general consequence in static code analysis. Even more so with whole-program optimization.
2 comments

It's also completely necessary to have any sort of reasonable language semantics. The goal is to have programmers be able to write code that does what they intend. With the C23 addition, time travelling UB doesn't exist, so programmers can write code that does what they intend up to the point of invoking UB. Good enough.

Let's say that's too difficult for compiler writers, so we bring back time travelling UB. That implies UB on a future execution path means the entire execution path has no semantics. We now have to ensure there is no UB on any future execution path to meet our goal. There are basically 4 options:

1. Rely on programmers to never write UB. This has not worked out historically.

2. Compilers must detect and/or prevent all UB statically. This is obviously impossible.

3. Runtimes must exhaustively detect and/or prevent all UB. This is both infeasible and expensive.

4. Give up on semantics for essentially all nontrivial programs. This is the situation today, but if we're going to make this the official position why should we even have a standard?

Maybe I don't understand something, but for me it seems pretty easy. What is needed to be done:

1. Make a list of all UB

2. Define the sensible compiler behavior in each case (for example, let MAX_INT+1 to calculate into MIN_INT on x86_64, just because `add` on x86_64 does that)

3. Treat this as a part of a standard, when compiling the code.

This approach allows to have different compiler behavior on different architectures, which are better suited for the architecture. Maybe on some architectures `add` on signed numbers will generate a CPU exception on overflow, so define this as a way to behave and go with it.

The requirement for “sensible” (i.e. repeatable) behavior breaks many simple, critical optimizations like maintaining the referent of a nominally un-aliased pointer in a register.

What if there’s UB & it is aliased? Some other pointer of a different type in scope also references the same value. The “sensible” thing to do when the value is updated through the alias is…?

That works for a lot of behavior but not everything. For example:

  int f(int x) {
    static int y[] = {42, 43};
    return y[x];
  }
What behavior should `f(-1)` or `f(100)` have? What is sensible?
Desugar to pointer arithmetic, try to do an dereference like

    *(y-1)
and more than likely segfault, or return the value at that address if it's somehow valid.