Hacker News new | ask | show | jobs
by uecker 305 days ago
One has to add that from the 218 UB in the ISO C23, 87 are in the core language. From those we already removed 26 and are in progress of removing many others. You can find my latest update here (since then there was also some progress): https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3529.pdf
2 comments

A lot of that work is basically fixing documentation bugs, labelled "ghosts" in your text. Places where the ISO document is so bad as a description of C that you would think there's Undefined Behaviour but it's actually just poorly written.

Fixing the document is worthwhile, and certainly a reminder that WG21's equivalent effort needs to make the list before it can even begin that process on its even longer document, but practical C programmers don't read the document and since this UB was a "ghost" they weren't tripped by it. Removing items from the list this way does not translate to the meaningful safety improvement you might imagine.

There's not a whole lot of movement there towards actually fixing the problem. Maybe it will come later?

> practical C programmers don't read the document and since this UB was a "ghost" they weren't tripped by it

I would strongly suspect that C compiler implementers very much do read the document, though. Which, as far as I can see, means "ghosts" could easily become actual UB (and worse, sneaky UB that you wouldn't expect.)

The previous language might cause a C compiler developer to get very confused because it seems as though they can choose something else but what it is isn't specified, but almost invariably eventually they'll realise oh, it's just badly worded and didn't mean "should" there.

It's like one of those tricky self-referential parlor box statements. "The statement on this box is not true"? Thanks I guess. But that's a game, the puzzles are supposed to be like that, whereas the mission of the ISO document was not to confuse people, so it's good that it is being improved.

Most of the "ghosts" are indeed just cleaning up the wording. But compiler writers historically often used any excuse that the standard is not clear to justify aggressive optimization. This starts with an overreaching interpretation of UB itself, to wacky concepts such as time-travel, wobbly numbers, incorrect implementation of aliasing (e.g. still in clang), and pointer-to-integer round trips.
I'm sure the compiler authors will disagree that they were "using any excuse". From their point of view they were merely making transformations between equivalent programs, and so any mistake is either that these are not in fact equivalent programs because they screwed up - which is certainly sometimes the case - or the standard should not have said they were equivalent but it did.

One huge thing they have on their side is that their implementation is concrete. Whatever it is that, say, GCC does is de facto actually a thing a compiler can do. The standards bodies (and WG21 has been worse by some margin, but they're both guilty) may standardize anything, but concretely the compiler can only implement some things. "Just do X" where X isn't practical works fine on paper but is not implementable. This was the fate of the Consume ordering. Consume/ Release works fine on paper, you "just" need to have whole program analysis to implement it. Well of course that's not practical so it's not implemented.

They sometimes screwed up, sometimes just because of bugs, or because different optimization passes had different assumptions that are inconsistent. This somehow contradicts your second point. Compiler have something things implemented which may be concrete on some sense (because it is in a compiler), but still not really a "thing" because it is a mess nobody can formalize using a coherent set of rules.

But then, they also sometimes misread the standard in ways I can't really understand. This often can be seen when the "interpretation" changes over time. Earlier compilers (or even earlier parts of the same compiler) implement the standard as written, some new optimization pass has some creative interpretation.

If I understand correctly, the "ghosts" are vacuously UB. As in, the standard specifies that if X, then UB, but X can in fact never be true according to the standard.
Fixing the actual problems is work-in-progress (as my document also indicates), but naturally it is harder.

But the original article also complains about the number of trivial UB.

And yet, I see P1434R0 seemingly trying to introduce new undefined behavior, around integer-to-pointer conversions, where previously you had reasonably sensible implementation defined behavior (the conversions “are intended to be consistent with the addressing structure of the execution environment").

https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p14...

Pointer provenance already existed before, but the standards were contradictory and incomplete. This is an effort to more rigorously nail down the semantics.

i.e., the UB already existed, but it was not explicit had to be inferred from the whole text and the boundaries were fuzzy. Remember that anything not explicitly defined by the standard, is implicitly undefined.

Also remember, just because you can legally construct a pointer it doesn't mean it is safe to dereference.

The current standard still says integer-to-pointer conversions are implementation defined (not undefined) and furthermore "intended to be consistent with the addressing structure of the execution environment" (that's a direct quote).

I have an execution environment, Wasm, where doing this is pretty well defined, in fact. So if I want to read the memory at address 12345, which is within bounds of the linear memory (and there's a builtin to make sure), why should it be undefined behavior?

And regarding pointer provenance, why should going through a pointer-to-integer and integer-to-pointer conversions try to preserve provenance at all, and be undefined behavior in situations where that provenance is ambiguous?

The reason I'm using integer (rather than pointer) arithmetic is precisely so I don't have to be bound by pointer arithmetic rules. What good purpose does it serve for this to be undefined (rather than implementation defined) beyond preventing certain programs to be meaningfully written at all?

I'm genuinely curious.

I fully agree with your analysis but compilers writers did think the could bend the rules, hence it was necessary to clarify that pointer-to-integer casts do work as intended. This still not in ISO C 23 btw because some compiler vendors did argue against it. But it is a TS now. If you are, please file bugs against your compilers.
Do you fully agree? I finally went and read n3005.pdf. The important item there is that a cast to integer exposes the pointer and now the compiler must be conservative and assume that the pointed object might be changed via non trackable pointers. This seems quite a reasonable compromise to make existing code work without affecting the vast majority of objects whose address is never cast to an integer. But ncruces wants defined semantics for arbitrary forged pointers.
You are right, I wasn't thinking straight. I do not fully agree. Creating arbitrary pointers can not work. Forging pointers to implementation-defined memory region would be ok though.
> I have an execution environment, Wasm, where doing this is pretty well defined, in fact. So if I want to read the memory at address 12345, which is within bounds of the linear memory (and there's a builtin to make sure), why should it be undefined behavior?

How would you define it? Especially in a way that is consistent with the rest of the language and allows common optimizations (remember that C supports variables, which may or may not be stored in memory)?

Just read whatever is at address 12345 of the linear memory. Doesn't matter what that is. If it's an object, if it was malloc'ed, if it's the "C stack", a "global".

It's the only way to interpret *(uint64_t*)(12345) when the standard says that a integer-to-pointer conversion is "intended to be consistent with the addressing structure of the execution environment".

There exists an instruction to do that load in Wasm, there's a builtin to check that 12345 points to addressable memory, the load is valid at the assembly level, the standard says the implementation should define this to be consistent with the addressing structure of the execution environment, why the heck are we playing games and allowing the compiler to say, "nope, that's not valid, so your entire program is invalid, and we can do what ever we want, no diagnostic required"?

If a newer version of that value is also stored in a register and not yet flushed to memory, should the compiler know to insert that flush for your or is reading a stale value ok?

For what it’s worth there’s a reason you’re supposed to do this kind of access through memcpy, not by dereferencing made up pointers.

> There exists an instruction to do that load in Wasm, there's a builtin to check that 12345 points to addressable memory, the load is valid at the assembly level, the standard says the implementation should define this to be consistent with the addressing structure of the execution environment, why the heck are we playing games and allowing the compiler to say, "nope, that's not valid, so your entire program is invalid, and we can do what ever we want, no diagnostic required"?

Because the language standard is defined to target a virtual machine as output, not any given implementation. That virtual machine is then implemented on various platforms, but the capabilities of the underlying system aren’t directly accessible - they are only there to implement the C virtual machine. That’s why C can target so many different target machines.

It is important to understand why undefined behaviour has proliferated over the past ~25 years. Compiler developers are (like the rest of us) under pressure to improve metrics like the performance of compiled code. Often enough that's because a CPU vendor is the one paying for the work and has a particular target they need to reach at time of product launch, or there's a new optimization being implemented that has to be justified as showing a benefit on existing code.

The performance of compilers is frequently measured using the SPEC series of CPU benchmarks, and one of the main constraints of the series SPEC series of tests is that the source code of the benchmark cannot be changed. It is static.

As a result, compiler authors have to find increasingly convoluted ways to make it possible for various new compiler optimizations to be applied to the legacy code used in SPEC. Take 403.gcc: it's based on gcc version 3.2 which was released on August 14th 2002 -- nearly 23 years ago.

By making certain code patterns undefined behaviour, compiler developers are able to relax the constraints and allow various optimizations to be applied to legacy code in places which would not otherwise be possible. I believe the gcc optimization to eliminate NULL pointer checks when the pointer is dereferenced was motivated by such a scenario.

In the real world code tends to get updated when compilers are updated, or when performance optimizations are made, so there is no need for excessive compiler "heroics" to weasel its way into making optimizations apply via undefined behaviour. So long as SPEC is used to measure compiler performance using static and unchanging legacy code, we will continue to see compiler developers committing undefined behaviour madness.

The only way around this is for non-compiler developer folks to force language standards to prevent compilers from using undefined behaviour to do that which normal software developers considers to be utterly insane code transformations.

Language standards have much less power than people think and compiler-vendors are of course present in the standard working groups. Ultimately, the users need to put pressure on the compiler vendors. Please file bugs - even if this often has no effect, it takes away the argument "this is what our users want". Also please support compilers based on how they deal with UB and not on the latest benchmark posted somewhere.
Language standards have plenty of power over compiler vendors, however, very few people that are not involved in writing compilers tend to participate in the standards process. Standards bodies bend to the will of those participating.
Dr. Dobbs used to have articles with those benchmarks, here are a couple of examples,

https://dl.acm.org/doi/10.5555/11616.11617

https://jacobfilipp.com/DrDobbs/articles/DDJ/1991/9108/9108h...

In a compiler, you essentially need the ability to trace all the uses of an address, at least in the easy cases. Converting a pointer to an integer (or vice versa) isn't really a deal-breaker; it's essentially the same thing as passing (or receiving) a pointer to an unknown external function: the pointer escapes, whelp, nothing more we can do in that case for the most part.

But converting an integer to a pointer creates a problem if you allow that pointer to point to anything--it breaks all of the optimizations that assumed they could trace all of the uses of an address. So you need something like provenance to say that certain back-conversions are illegal. The most permissive model is a no-address-taken model (you can't forge a pointer to a variable whose address was never taken). But most compilers opt instead for a data-dependency-based model: essentially, even integer-based arithmetic of addresses aren't allowed to violate out-of-bounds at the point of dereference. Or at least, they claim to--the documentation for both gcc and llvm have this claim, but both have miscompilation bugs because they don't actually allow this.

The proposal for pointer provenance in C essentially looks at how compilers generally implement things and suggests a model that's closer to their actual implementation: pointer-to-integer exposes the address such that any integer-to-pointer can point to it. Note this is more permissive than the claimed models of compilers today--you're explicitly able to violate out-of-bounds rules here, so long as both objects have had their addresses exposed. There's some resistance to this because adhering to this model also breaks other optimizations (for example, (void*)(uintptr_t)x is not the same as x).

As a practical matter, pointer provenance isn't that big of a deal. It's not hard to come up with examples that illustrate behaviors that cause miscompilation or are undefined specifically because of pointer provenance. But I'm not aware of any application code that was actually miscompiled because the compiler implemented its provenance model incorrectly. The issue gets trickier as you move into systems code that exists somewhat outside the C object model, but even then, most of the relevant code can ignore their living outside the object model since resulting miscompiles are prevented by inherent optimization barriers anyways (note that to get a miscompile, you generally have to simultaneously forge the object's address, have the object's address be known to the compiler already, and have the compiler think the object's address wasn't exposed by other means).

Pointer provenance was certainly not here in the 80s. That's a more modern creation seeking to extract better performance from some applications at a cost of making others broken/unimplementable.

It's not something that exists in the hardware. It's also not a good idea, though trying to steer people away from it proved beyond my politics.

Pointer provenance probably dates back to the 70s, although not under that name.

The essential idea of pointer provenance is that it is somehow possible to enumerate all of the uses of a memory location (in a potentially very limited scope). By the time you need to introduce something like "volatile" to indicate to the compiler that there are unknown uses of a variable, you have to concede the point that the compiler needs to be able to track all the known uses within a compiler--and that process, of figuring out known uses, is pointer provenance.

As for optimizations, the primary optimization impacted by pointer provenance is... moving variables from stack memory to registers. It's basically a prerequisite for doing any optimization.

The thing is that traditionally, the pointer provenance model of compilers is generally a hand-wavey "trace dataflow back to the object address's source", which breaks down in that optimizers haven't maintained source-level data dependency for a few decades now. This hasn't been much of a problem in practice, because breaking data dependencies largely requires you to have pointers that have the same address, and you don't really run into a situation where you have two objects at the same address and you're playing around with pointers to their objects in a way that might cause the compiler to break the dependency, at least outside of contrived examples.

My grievance isn't with aliasing or dataflow, it's with a pointer provenance model which makes assumptions which are inconsistent with reality, optimises based on it, then justifies the nonsense that results with UB.

When the hardware behaviour and the pointer provenance model disagree, one should change the model, not change the behavior of the program.

Give me an example of a program that violates pointer provenance (and only pointer provenance) that you think should be allowed under a reasonable programming model.
> It's not something that exists in the hardware

This is sort of on the one hand not a meaningful claim, and then on the other hand not even really true if you squint anyway?

Firstly the hardware does not have pointers. It has addresses, and those really are integers. Rust's addr() method on pointers gets you just an address, for whatever that's worth to you, you could write it to a log maybe if you like ?

But the Morello hardware demonstrates CHERI, an ARM feature in which a pointer has some associated information that's not the address, a sort of hardware provenance.

I'm not a compiler writer, but I don't know how you would be able to implement any optimization while allowing arbitrary pointer forging and without whole-program analysis.
It's an interesting question.

Say you're working with assembly as your medium, on a von neumann machine. Writing to parts of the code section is expected behaviour. What can you optimise in such a world? Whatever cannot be observed. Which might mean replacing instructions with sequences of the same length, or it might mean you can't work out anything at all.

C is much more restricted. The "function code" isn't there, forging pointers to the middle of a function is not a thing, nor is writing to one to change the function. Thus the dataflow is much easier, be a little careful with addresses of starts of functions and you're good.

Likewise the stack pointer is hidden - you can't index into the caller's frame - so the compiler is free to choose where to put things. You can't even index into your own frame so any variable whose address is not taken can go into a register with no further thought.

That's the point of higher level languages, broadly. You rule out forms of introspection, which allows more stuff to change.

C++ has taken this too far with the object model in my opinion but the committee disagrees.

Why? What specific optimization do you have in mind that prevents me from doing an aligned 16/32/64-byte vector load that covers the address pointed to by a valid char*?
Casting a char pointer to a vector pointer and doing vector loads doesn't violate provenance, although it might violate TBAA.

Regarding provenance, consider this:

  void bar();
  int foo() {
    int * ptr = malloc(sizeof(int));
    *ptr = 10;
    bar();
    int result = *ptr;
    free(ptr);
    return result;
  }
If the compiler can track the lifetime of the dynamically allocated int, it can remove the allocation and covert this function to simply

  int foo() { 
      bar();
      return 10;
  }
It can't if arbitrary code (for example inside bar()) can forge pointers to that memory location. The code can seem silly, but you could end up with something similar after inlining.
Can't reply to the sibling comment, for some reason.

If you don't know the extents of the object pointed to by the char*, using an aligned vector load can reach outside the bounds of the object. Keeping provenance makes that undefined behavior.

Using integer arithmetic, and pointer-to-integer/integer-to-pointer conversions would make this implementation defined, and well defined in all of the hardware platforms where an aligned vector load can never possibly fail.

So you can't do some optimizations to functions where this happens? Great. Do it. What else?

As for why you'd want to do this. C makes strings null-terminated, and you can't know their extents without strlen first. So how do you implement strlen? Similarly your example. Seems great until you're the one implementing malloc.

But I'm sure "let's create undefined behavior for a libc implemented in C" is a fine goal.

It very much is something that exists in hardware. One of the major reasons why people finally discovered the provenance UB lurking in the standard is because of the CHERI architecture.
So it's something that exists in some hardware. Are you claiming that it exists in all hardware, and we only realized that because of CHERI? Or are you claiming that it exists in CHERI hardware, but not in others.

If it only exists in some hardware, how should the standard deal with that?

> If it only exists in some hardware, how should the standard deal with that?

Generally seems to me the C standard makes things like that UB. Signed integer overflow, for example. Implemented as wrapping two's-complement on modern architectures, defined as such in many modern languages, but UB in C due to ongoing support for niche architectures.

The issues around pointer provenance are inherent to the C abstract machine. It's a much more immediate show-stopper on architectures that don't have a flat address space, and the C abstract machine doesn't assume a flat address space because it supports architecture where that's not true. My understanding is that reflects some oddball historical architectures that aren't relevant anymore, nowadays that includes CHERI.

People keep forgetting that SPARC ADI did it first with hardware memory tagging for C.