Hacker News new | ask | show | jobs
by perl4ever 2495 days ago
The concept of "undefined" is incoherent. At the same time as people insist the compiler can do anything under the circumstances, everyone accepts that there is some limit to what it is reasonable to expect it to do. It's all just quibbling over where exactly the limits are. But as long as there are limits, the definition of undefined was never valid.

It seems to me that the problem is that trying to define undefined behavior is an inherent contradiction.

Setting aside the question of what exactly "undefined behavior" means, why does a language spec have to include it? If there is behavior that cannot be defined, why not just omit it from the standard?

2 comments

> Setting aside the question of what exactly "undefined behavior" means, why does a language spec have to include it? If there is behavior that cannot be defined, why not just omit it from the standard?

The original reason was that there were things they didn't want to define. For example, signed integer overflow works differently on different hardware architectures. If they defined one behavior in the standard then compilers for architectures that didn't do it that way would have to do something inefficient to make it work the way the standard says it should rather than the way that hardware actually does it.

Calling it "undefined behavior" lets the compiler do whatever the hardware does even if that means the program produces different results on different architectures. It also means that if some new architecture comes out that does it slightly differently, nobody can be surprised when compilers use the native overflow behavior for that architecture.

The flaw was in giving compilers too much discretion. They were generally expected to implement one of the sane versions of signed integer overflow, and specifically the one corresponding to the relevant hardware architecture, but according to the spec they can literally do whatever they want. So we get this:

https://kristerw.blogspot.com/2016/02/how-undefined-signed-o...

  x + c < x       ->   false
Which means you can't use that to check whether signed overflow occurred even when you know the underlying hardware behavior, because if it did occur you've already invoked UB and the compiler is allowed to do anything, including omit your check, which it does.

What would help a bit is if compilers are going to do something like this, they emitted a warning something like "comparison is always false because signed integer overflow is undefined."

What would help even more is for the next version of the standard to convert a lot of this undefined behavior into implementation-defined behavior or similar, which still allows for hardware-specific implementations but requires them to be documented and prevents a lot of this unintuitive ex post facto "optimization" that causes more trouble than it's worth.

"Calling it "undefined behavior" lets the compiler do whatever the hardware does even if that means the program produces different results on different architectures"

Isn't this "implementation dependent", rather than "undefined"?

This is too narrow a view on things imo.

> The original reason was that there were things they didn't want to define.

For signed integer overflow, maybe. I don't claim to know how this evolved in every last detail, but this is definitely what UB is currently for - there's specifically "implementation-defined behavior" (actual behavior must be documented by the implementation) or "unspecified behavior" (can be non-deterministic, possibly limited) for what you are describing.

http://eel.is/c++draft/intro.abstract

Undefined behavior is what allows many optimizations to be made in the first place, and it is also necessary so that compilers don't have to solve the halting problem.

> What would help a bit is if compilers are going to do something like this, they emitted a warning something like "comparison is always false because signed integer overflow is undefined."

Yes, in that specific case that would be a useful warning. Linters can do that for you. But compilers make use of this assumption all the time, for example when optimizing for loops. Would you like a warning every time the compiler made your loops faster by relying on this UB? Every time a pointer is dereferenced?

> What would help even more is for the next version of the standard to convert a lot of this undefined behavior into implementation-defined behavior or similar, which still allows for hardware-specific implementations but requires them to be documented and prevents a lot of this unintuitive ex post facto "optimization" that causes more trouble than it's worth.

For a lot of UB that is not even an option. How do you find the correct initialization order for dynamic initialization? You can't, you'd have to solve the halting problem. It's the programmer's job to get this right, not the compiler's. What should messing this up result in, if not UB?

And you may not like it, but p0907 (which requires signed integers to use two's complement) suggested to make signed integer overflow defined and had that suggestion strongly declined. You put "optimization" in quotes but that's exactly what this is about - in practice it would make tons of code (in particular loops) significantly slower to eliminate this UB. You're free to doubt WG21 but I won't.

http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p090...

Note that there are compiler switches in most compilers to make signed overflow defined if this is your main gripe with UB.

> Would you like a warning every time the compiler made your loops faster by relying on this UB?

Yes, because then I know to convert the loop counter to unsigned, which it ought to be anyway so that there isn't problematic behavior if the signed value actually did overflow when using a compiler or compiler flags that don't take that optimization.

> Every time a pointer is dereferenced?

Every time a pointer is dereferenced and the compiler uses that fact to cause some other statement to have no effect? I want to see that warning, yes.

> You put "optimization" in quotes but that's exactly what this is about - in practice it would make tons of code (in particular loops) significantly slower to eliminate this UB.

That's an argument for why it shouldn't be two's complement, not for why it has to be fully undefined behavior. If you're going to make signed integers never overflow when used as a loop counter, what's wrong with documenting that and offering a warning in -Wall or -Wextra when it happens?

And it's nothing specifically to do with signed integer overflow. If you're removing code the programmer wrote or making conditional statements unconditional because it can only happen in the presence of UB, that's a huge red flag that there is a bug in that program and the compiler should not be silent about it.

> then I know to convert the loop counter to unsigned, which it ought to be anyway

See below.

> Every time a pointer is dereferenced and the compiler uses that fact to cause some other statement to have no effect?

No, every time any logic in the compiler does any UB-based inference. And that's essentially always. For example, you can't reorder variables unless you assume the abstract machine semantics and memory model. You can't elide repeated loads either. And eliding reloads is a very simple case of some expression having no effect.

More generally, for every compiler optimization I can tell you a UB source that breaks that optimization (i.e. a UB-backed guarantee of the standard that the compiler has to rely on). So should we not do any compiler optimizations at all? That's freely available to you in every compiler.

> That's an argument for why it shouldn't be two's complement, not for why it has to be fully undefined behavior.

Sorry, I don't understand your point. Why should integers not be two's complement? How would that help?

> If you're going to make signed integers never overflow when used as a loop counter, what's wrong with documenting that and offering a warning in -Wall or -Wextra when it happens?

What do you mean "when it happens"? The compiler can't in general determine at compile-time whether a loop counter will overflow. Are you suggesting all loops with signed counter should produce a warning because everyone should be using unsigned loops? If you want slow-but-safe-by-default language then you are simply at the wrong address with C++.

> If you're removing code the programmer wrote...

That's essentially the compiler's whole point. Cut through all the abstraction and generate efficient machine code. Yes, I want that three-iteration loop unrolled, I don't actually intend to perform 3 increments and 4 comparisons (and special overflow handling) in machine code. Yes, I want all those container access functions (full of unspoken range assumptions) or recursive variadic templates (full of unconditionally-false-once-expanded ifs) inlined and not appearing at all in the assembly. Try running a no-optimizations build of any larger piece of C++ software and see what I mean.

> ...or making conditional statements unconditional because it can only happen in the presence of UB that's a huge red flag that there is a bug in that program and the compiler should not be silent about it.

You're thinking of trivial situations where the compiler could reasonably guess that relying on UB causes an unwanted optimization. But you can't build a compiler around only the nice and happy cases - what if that situation occurs 4 levels deep in some template code where 3 other functions were already inlined and the compiler can see that in that specific case some condition cannot be true without UB. Would you like a warning about every such case?

It's just not the compiler's job to second-guess your code. There are tools (specifically linters) that are built to detect these easy cases you're thinking of and help you find these bugs (but they won't help you with bugs in the hard cases either).

> But as long as there are limits, the definition of undefined was never valid.

There are always limits. Your CPU is (for the most part) deterministic, and no amount of UB will change that (well, the nuclear missiles launched due to UB might...).

> It seems to me that the problem is that trying to define undefined behavior is an inherent contradiction.

Here is the definition of UB according to the C++ standard:

    "This document imposes no requirements on the behavior of programs that contain undefined behavior."  
http://eel.is/c++draft/intro.abstract

Don't try to define or reason about the consequences of UB, that's pointless. Just don't provoke any undefined behavior and you get to live in the clearly defined world of the standard.

> Setting aside the question of what exactly "undefined behavior" means, why does a language spec have to include it? If there is behavior that cannot be defined, why not just omit it from the standard?

"X is UB" means "compiler writers may freely assume that X is not done". If you omit that then compilers would have to verify that X is not done, and there are requirements in the standard which would require the halting problem to be solved in order to verify them in user code. The standard likes to avoid forcing compilers to solve the halting problem.

"This document imposes no requirements on the behavior of programs that contain undefined behavior."

The NY Vehicle and Traffic law imposes no requirements on the behavior of drivers who engage in cannibalism. However, it would be odd to interpret this as meaning that if you commit cannibalism, you are exempt from all rules regarding motor vehicles.

There are clearly two kinds of "undefined" behavior - the kind that is defined as undefined, and the kind that is not. To understand either, you have to understand both.

> The NY Vehicle and Traffic law imposes no requirements on the behavior of drivers who engage in cannibalism.

Are you trying to argue that the standard quote is unclear? That you think it can be read "imposes no additional/special requirements" (because that's the interpretation that your traffic law argument assumes)? Because if you ignore the nonsensical meaning, I would read your traffic law sentence as "imposes no requirements whatsoever".

Regardless of what your stance is regarding possible ambiguity in the way that sentence is worded, both the intent and the practical consequences of that statement are abundantly clear: If your program has UB (per what the C++ standard considers UB), then the C++ standard makes absolutely no guarantees what will happen when you run it.

> There are clearly two kinds of "undefined" behavior - the kind that is defined as undefined, and the kind that is not. To understand either, you have to understand both.

I don't understand what you are trying to say. There is only one kind of undefined behavior. If you follow the rules of the C++ standard you get to live in a nice and predictable world. If you don't, anything can happen and you're on your own.

There's more than one kind of undefined behavior, and probably more than one way to categorize it.

The distinction I was making is between "what the...standard considers UB" and what the standard doesn't consider period. For instance, the standard doesn't (I assume) declare anything about the effect of cosmic rays on C++ programs. However, that does not mean that C++ compilers are designed or should be designed not to work unless run on equipment that is completely shielded.

There is a semantic difference that seems important to me, but which continually slips away in these discussions. And it's palpably related, for me, to the issues people have with compiler behavior. It's not totally the standard at issue, I don't think, but the culture that provides its context.

It would be nice if following a language standard meant that you get to live in a nice and predictable world, but isn't this an absurd statement?