Hacker News new | ask | show | jobs
by thomashabets2 31 days ago
Author here.

> It barely scratches the surface.

I agree. The point of the post is not to enumerate and explain the implications of all 283 uses of the word "undefined" in the standard. Nor enumerate all the things that are undefined by omission.

The point of the post is to say it's not possible to avoid them. Or at least, no human since the invention of C in 1972 has.

And if it's not succeeded for 54 years, "try harder", or "just never make a mistake", is at least not the solution.

The (one!) exploitable flaw found by Mythos in OpenBSD was an impressive endorsement of the OpenBSD developers, and yet as the post says, I pointed it at the simplest of their code and found a heap of UB.

Now, is it exploitable that `find` also reads the uninitialized auto variable `status` (UB) from a `waitpid(&status)` before checking if `waitpid()` returned error? (not reported) I can't imagine an architecture or compiler where it would be, no.

FTA:

> The following is not an attempt at enumerating all the UB in the world. It’s merely making the case that UB is everywhere, and if nobody can do it right, how is it even fair to blame the programmer? My point is that ALL nontrivial C and C++ code has UB.

5 comments

Fair enough!

> And if it's not succeeded for 54 years, "try harder", or "just never make a mistake", is at least not the solution.

And I 100% agree. UB is way overused by these standards for how dangerous it is, and as a consequence using C (and C++) for anything nontrivial amounts to navigating a minefield.

I think as compilers got smarter, UB changed somewhat in meaning. Originally the compilers didn't perform such complex analysis, and while invoking UB could break your program, it would still do something reasonable.
Yes, but compilers got smart enough for it to be a problem around 30 years ago, and we are still arguing about what to do.
You see a reasoning here, basically when all those C compiler benchmarks started, vendors moved from what Frank Allen described, to anything goes to win SPEC something benchmarks.

"Oh, it was quite a while ago. I kind of stopped when C came out. That was a big blow. We were making so much good progress on optimizations and transformations. We were getting rid of just one nice problem after another. When C came out, at one of the SIGPLAN compiler conferences, there was a debate between Steve Johnson from Bell Labs, who was supporting C, and one of our people, Bill Harrison, who was working on a project that I had at that time supporting automatic optimization...The nubbin of the debate was Steve's defense of not having to build optimizers anymore because the programmer would take care of it. That it was really a programmer's issue.... Seibel: Do you think C is a reasonable language if they had restricted its use to operating-system kernels? Allen: Oh, yeah. That would have been fine. And, in fact, you need to have something like that, something where experts can really fine-tune without big bottlenecks because those are key problems to solve. By 1960, we had a long list of amazing languages: Lisp, APL, Fortran, COBOL, Algol 60. These are higher-level than C. We have seriously regressed, since C developed. C has destroyed our ability to advance the state of the art in automatic optimization, automatic parallelization, automatic mapping of a high-level language to the machine. This is one of the reasons compilers are ... basically not taught much anymore in the colleges and universities."

-- Fran Allen interview, Excerpted from: Peter Seibel. Coders at Work: Reflections on the Craft of Programming

What should the behavior above be defined to do?
“Implementation defined behaviour”: compiler author chooses, and documents the choice.

A lot of UB should be implementation defined behaviour instead; this would much better match programmers’ intuitions as they reason about their code - you can even see it in the comments of this post: it’s always things like “this hardware supports / doesn’t support unaligned accesses”, it’s never nasal demons.

I told someone at a conference that UB actually means "implementation-defined, no documentation required". He started to refute me and then stopped.
That isn't true, for UB the compiler is allowed to assume the UB can never happen. For example if you dereference a pointer and only after check if it is NULL, the compiler can remove the NULL check, since it is clearly impossible (nevermind that you might be on a microcontroller where NULL is a valid address).

The fallout of this are quite large! If behaviour is implementation defined the compiler has to stick to one consistent behaviour. No such need for UB, you can get different behaviour bu changing unrelated code, by changing between debug and release or just because of what garbage happened to be on the stack.

Since the compiler is allowed to assume the UB doesn't happen it will also sometimes look like the compiler miscompiled your code elsewhere, but what actually happened was some inlining followed by extrapolating "this can never happen".

UB is often surprising: I have seen unaligned loads crash on x86 due to it bring UB in C (even though x86 is generally fine with it). But once a newer compiler decided that it was fine to vectorise that code (since it clearly aligned) the CPU was no longer happy with it.

I think parent commenter made a joke. UB can be seen as "implementation defines this to reformat your hard drive. No we don't document it".

That is, the compiler de facto defines what happens when you compile UB code.

So you're not wrong, but I think you missed the sarcastic spin of parent.

Except that UB doesn't mean that. UB means "the developer must never write this".
Both are wrong. It means "this standard does not constrain the behaviour of code that does this".

It's entirely legal for implementations to have predictable behaviour, documented or not, for code that is undefined by the standard. In their quest for maxxing benchmark performance they generally choose not to, but there's really nothing in any standard that stops you from making an implementation that prioritises safety.

Print x twice. Not all “side effects” care about order.

Better yet, define an order for parameter evaluation.

There is an easy way to take control: read the volatile variable only once.

  volatile int x = 5;
  ...
  int y=x;
  printf("%d in hex is 0x%x.\n", y, y);
You're missing the point. Volatile forces two loads of a value that may have changed in the middle. So the value of "x" may depend on the time/order of load.
Which is, if I understand correctly, the entire point of volatile. Don't use it if you don't want that behavior.

And in fact, in the example given, if there is something (another thread or whatever) that can change the value of x, then you don't know what either number will be. Well, in that circumstance, without volatile, it may print the same number both times, but you still don't know what the number will be (unless the read gets optimized away entirely).

If that behavior is the entire point, then I think the bigger point is that the spec should reflect that and not call it undefined.
Why is that missing the point? Loading it twice, possibly with different values, is the intended behavior. It's only undefined because the C spec doesn't specify the order of the loads (unlike most other languages which have a perfectly well-defined order for side effects in a single expression).
What you are describing is implementation defined behavior. Using that is perfectly safe and reasonable. Undefined means this programs is malformed.
Couldn’t you just define that function arguments are evaluated left to right?

Or just throw an error.

Why? Just for this edge case? It could be faster and/or allow smaller code size to allow this to be undefined.

Undefined is also different from "depends on the compiler", because which behavior is chosen can even depend on the circumstances, whatever code appears before and/or after it.

That said, UB in code, such as this example of ordering of reads of volatile parameters being undefined, does not automatically mean that code that uses it is bad. It may very well be that the function being called doesn't misbehave either way.

That’s the point of the whole article. It’s not worth the speed gain to have a language that nobody can safely use because you can’t really prevent UB when you write it.

> It may very well be that the function being called doesn't misbehave either way.

The function being good or bad has nothing to do with the UB. The UB occurs before the function is called.

I meant reading the uninitialized variable
There is no uninitialized variable, I explicitly initialized it to 5.

And yes indeed, C could do what Rust does and define the order of evaluation for function arguments.

If the argument expressions are indeed side-effect-less, the compiler can always make use of the "as-if" rule and legally reorder the computation anyway, for example to alleviate register pressure.

HCF
I have good news about what UB allows
What is that?
A fictitious assembly instruction (and pretty good TV series).

https://en.wikipedia.org/wiki/Halt_and_Catch_Fire_(computing...

Halt and Catch Fire
Compilation error
It’s hard to detect all UB at compile time
It’s harder depending on the language, which is clearly the point.
> Now, is it exploitable that `find` also reads the uninitialized auto variable `status` (UB) from a `waitpid(&status)` before checking if `waitpid()` returned error? (not reported) I can't imagine an architecture or compiler where it would be, no.

I presume you're referring to this code:

  pid = waitpid(pid, &status, 0);
  if (WIFEXITED(status))
    rval = WEXITSTATUS(status);
  else
    rval = -1;
The only signal handler find installs is for SIGINFO, and it uses the SA_RESTART flag, so EINTR can be ruled out. The pid argument is definitely valid as you can't reach the above if it wasn't, and there's no other way for the child process to be reaped[1], so no ECHILD.

A check should probably be added in case the situation changes in the future, triggering spooky action at a distance, or were that code to be copy+pasted somewhere where the invariants didn't hold. But I think the current code in its current context is, strictly speaking, correct as-is.

[1] OpenBSD lacks the kernel features for such surprises that might theoretically be possible on Linux.

Indeed. That's why I didn't deem it worth reporting.

But in my code, I would have fixed for the reasons you mention. Sprinkle enough of these around, and some low percentage will in the future have its assumption invalidated.

Couldn’t waitpid return EINTR if the (parent) process were stopped and then continued?

EINTR scares the crap out of me because nobody expects it!

No. You only get EINTR when a signal handler fires and you didn't use the SA_RESTART flag with sigaction. If you don't install any signal handlers, or you use SA_RESTART on all handlers, or you've blocked/masked all signals (or at least the ones with handlers), you won't get EINTR.

When writing library code, it's important to consider EINTR because you can't know about signal dispositions. Though, the common practice of looping on EINTR kind of defeats the purpose.

> Or at least, no human since the invention of C in 1972 has.

No human without proper tools maybe, but what about seL4? It goes beyond proving the code is UB-free and actually formally verifies the code works as intended. And the code is written in C. (the proofs of course aren't)

The proof is interesting because it goes beyond just proving the C code is correct. For some platforms, they compile the code with an ordinary compiler, and verify that the machine code does what the C code is supposed to do. (that's because just writing correct C code doesn't help you if you trigger a compiler bug)

This works even if the compiler (in this case, GCC) isn't verified - they verify a specific output of the compiler, not that the compiler always generates machine code correctly.

> The point of the post is to say it's not possible to avoid them. Or at least, no human since the invention of C in 1972 has.

What are you talking about? UB was coined only in the first C standard, in 1989. Prior to that there was no "If you do this, anything can happen". It was "If you do this, that will happen".

> UB was coined only in the first C standard, in 1989

Pre 1989, when C did not have a standard, was the behavior unspecified or undefined? That is, of course, a trick question. Because in this context the very definitions of the words come from the standard itself.

Before a language gets a specification, is the de facto specification the five words "you know what I mean"?

The very definition of "UB" in C later became "[…] this document imposes no requirements". Is that not the same thing as "there is to specification (yet)"?

It sounds very zen, but "a non existing specification imposes no requirements".

But I don't think it's meaningful to argue the semantic difference before the (in-context) existence of the words "undefined" vs "unspecified".

> Prior to that there was no "If you do this, anything can happen".

Of course it was. You relied on "common sense".

> It was "If you do this, that will happen".

Haha, of course it wasn't. Before a specification there is neither a definition of "this" nor "that".

Unless you mean ye olde "the compiler implementation is the specification". In which case we'll get dragged into "what even is a language" and "what is the sound of one hand clapping?".

Or, alternatively, it's as true then as it is today. If you go by "GCC x.y.z on platform Z kernel Y, (etc…) is the specification" then there is no UB.

More like, "if you do this, what happens depends on your particular combination of hardware, operating system, and compiler. Don't ask us."
No, that would be implementation defined.
The post I was replying to said,

> UB was coined only in the first C standard, in 1989. Prior to that there was no "If you do this, anything can happen".

I.e., the context is, before UB existed as a concept, how would these things be categorized. And I was trying to offer the correction that, before UB existed, it wasn't "all behavior is defined" but rather many behaviors depend on your particular local environment. While that may technically be implementation defined, the current standard requires that implementation defined be documented, and UB-like edge cases were most definitely not documented anywhere consistently in the old days!

No, that's actually UB. The important bit here is "compiler defined" -- UB means the compiler is allowed to assume it never happens while compiling.

Consider, for example, an implementation defined function f() -- which can also diverge/crash horribly, etc.

If I write

    if p {
      print("p is true")
    } else {
      g()
    }

    if p {
      f()
    }
Then either we: - print p is true and execute f - do nothing

This is true regardless of if f immediately crashes the computer, nasal demons, whatever -- that's implementation defined.

UB means f may never happen.

And that means the compiler may optimize this to just:

    g()
Notice the difference here -- the print never happens!, and g always happens.

You can see why this is concerning when you write code like

    if dry_run {
      print("would run rm -rf /")
    } else {
      run("rm -rf /")
    }

    if dry_run {
      // oops: some_debug_string is NULL and will segfault!
      print(some_debug_string);
    }
I see what you're going for, but I don't see how your example is UB. If `p` is a pointer, and, after your `if (p)` check, `p` is dereferenced unconditionally, then yes, your check for `p == NULL` could be removed, and the code under the `if` would be removed as well. But the example you've constructed is not UB.
You misunderstood their example, I think.

If doesn't matter what 'p' is in their example. The point is: if 'f' is undefined behavior (rather than just impl-defined), then the optimizer concludes that the "if p { f() }" can never happen... which means that we're allowed to assume that 'if p { ... } else { ... }' (in the first part of the example) will always take the else branch. The compiler will optimize accordingly and just always call g() unconditionally.

> if nobody can do it right, how is it even fair to blame the programmer? My point is that ALL

It's fair to blame the programmer for the choice of programming in a language like this, if it was in fact their choice. As you've so eloquently put, choosing those languages is essentially equivalent to choosing UB, so starting a new project with one of them is 100% blameworthy when the UB is inevitably found.

Not all projects are green field. But sure, new modules can be written in other languages. And C is, as cross-language barriers go, fairly easy to interface with.