Hacker News new | ask | show | jobs
by simonask 28 days ago
Many, many programmers come to C (and C++) with a lower-level understanding that actually gets in the way here. They understand that all types "are" just bytes and that all pointers "are" just register-sized integer addresses, because that's how the hardware works and has worked for decades.

It's perfectly reasonable to expect any load through `int*` to just load 4 bytes from memory, done and done. They get surprised that it is far from the whole story, and the result is UB.

Meanwhile, the actual computers we have been using for decades have no problems actually just loading 4 bytes through any arbitrary pointer with zero overhead. But no.

6 comments

> They understand that all types "are" just bytes and that all pointers "are" just register-sized integer addresses, because that's how the hardware works and has worked for decades.

I'd clarify this with "They understand that all values are just bytes".

> Meanwhile, the actual computers we have been using for decades have no problems actually just loading 4 bytes through any arbitrary pointer with zero overhead.

It's partly the standards fault here - rather than saying "We don't know how vendors will implement this, so we shall leave it as implementation-defined", they say "We don't know how vendors will implement this, so we will leave it as undefined".

A clear majority of the UB problems with C could be fixed if the standards committee slowly moved all UB into IB. It's not that there isn't any progress (Signed twos-complement is coming, after all), it's that there is (I believe) much pushback from compiler authors (who dominate the standards) who don't want to make UB into IB.

> A clear majority of the UB problems with C could be fixed if the standards committee slowly moved all UB into IB

There is no such thing as getting rid of "all UB."

What behavior is the implementation supposed to prescribe for a write to an unpredictable garbage address you read from the network? It could overwrite your code. It could overwrite any value anywhere. It could overlap with anything. Prescribing defined behavior for absolutely everything would require defining a precise, unoptimizable 1-to-1 mapping to assembly code and disallowing any multithreading.

> What behavior is the implementation supposed to prescribe for a write to an unpredictable garbage address you read from the network? I

"The compiler is not allowed to elide a write to a garbage address".

Wasn't that easy?

No, because that doesn't mean anything. All the other guarantees the compiler has to make about the other parts of your program have potentially been violated by the write, so the compiler can't guarantee any particular behavior.

Hence, "undefined behavior," of the entire program, not just that particular write.

Turning undefined behavior into implementation defined behavior is rarely a fix, though.
It's a fix that removes the most pointy part of UB.

"Going past the end of the array results in addressing arbitrary values" I can live with. "Going past the end of an array results in anything happening" is a hard sell.

Is that really a meaningful distinction?

Once you are addressing arbitrary values you are firmly in the realm of "anything happening" in practice, but you've now given up optimization opportunities. As has been repeatedly demonstrated over the years, once memory safety breaks it is practically impossible to make any guarantees about program behavior.

Yes, it's a meaningful distinction. No you are not into "anything happening" in practice.

Your compiler emitting a load operation and it failing isn't "anything". The failure being handled by code that the compiler authors can't predict doesn't make it "anything".

And if you lose optimization opportunities because of this it's because your optimization is broken. By the way, if you lose optimization opportunities because of this, that means both codes are meaningfully different and you knew it all the time.

Compilers elide loads all the time this is one of the more basic optimizations a compiler can do. We just mostly think those are "good" optimizations.
I mean... You can turn a one byte out of bounds write into code execution.

https://daniel.haxx.se/blog/2016/10/14/a-single-byte-write-o...

And if you get code execution, then you by definition have "anything".

I think it’s a really easy sell, actually: if you go past the end of the array far enough you end up accessing the stack which includes parts of the program like “where does this function return to” or “what is the index used to perform this access” or “there is no page mapped there”. None of these are arbitrary values.
The "anything can happen" means that the compiler can simply silently refuse to emit the code does the access.

Documenting that the instructions to access will always be eliminated makes it easier to predict what will happen.

Can you unravel this further (for those of us who don’t know compilers)? I’ve always assumed access past the end of an array can’t always be detected in C, so I don’t see how those instructions could be eliminated.

For example, a dynamically linked library that takes in a pointer, and then writes to the 10 ints after it—whether or not this behavior is defined is determined after that library is compiled, right?

Yes, but usually you don't want this. You think you do, but you don't: you can't always eliminate these, and often eliminating the extra accesses is not the most efficient thing to do either. Sometimes it's faster to have the loads and not check, sometimes you can check and skip that path, etc.
Are you talking about creating a pointer (more than one item) past an array, or dereferencing that pointer? Both are currently UB.

For the former, I kinda get it. It may need to be there for cases like with segmented address space where p+10 could actually be a value less than p, for the eventually generated assembly. Maybe it should be fine to create such a pointer, but have it be "indeterminate value" or whatever, if you try to compare that pointer to anything? I don't know enough about compiler internals to say one way or the other.

Dereferencing, though, can only be UB. There may not be a "value" behind that address. There may be a motor that's been I/O mapped, or a self destruct button.

I'm not saying that the result of the dereference be known, I'm saying that the instructions to do the dereference be always emitted.

Right now, if a dereference results in UB, the compiler may omit it entirely.

I think I would defer to someone more of a language lawyer than we, but I'm not sure what you're describing can be expressed in the C abstract machine. If a pointer is invalid, not pointing to an object, then I'm not sure it means anything to "read from there".

I know what you mean, but I'm just not sure you're describing something that fits what C "is". We program C to the abstract machine specified in the standard (5.1.2), and the compiler's job is to translate that into something with identical behavior on particular hardware. Piercing the layers down to actual hardware or assembly isn't really done.

Even "volatile" just says (basically) "touching this object has side effects". It implies no double-loading, speculative store, etc, but doesn't say "don't emit assembly instructions to load this unless the program logic path takes the route where the C program does load it".

The standard is not using ancient language when it refers to "objects with static storage duration" instead of "heap" or ".data segment". It is the true class of objects in the abstract machine.

Wouldn't that make a compiler that emitted bounds checks violate the standard, since it would not be emitting the actual memory operations if you deref out of bounds?
>It's partly the standards fault here - rather than saying "We don't know how vendors will implement this, so we shall leave it as implementation-defined", they say "We don't know how vendors will implement this, so we will leave it as undefined

I'd agree to a point. I still think it's unreasonable for compiler writers to get all lawyery about precise terminology. After all "implementation defined" could still be subject to the same lawyeriness (we implemented it, ergo we define it).

To me this is an issue of culture. We need to push back against the view that UB means anything can happen, therefore the compiler can do anything.

But it's genuinely useful. In all seriousness, are you sure you aren't perhaps just using the wrong language? At this point UB and leveraging it for optimization are core parts of the most performant C implementations.

That said, I think there are many cases where compilers could make a better effort to link UB they're optimizing against to UB that appears in the code as originally authored and emit a diagnostic or even error out. But at least we've got ubsan and friends so it seems like things are within reason if not optimal.

>are you sure you aren't perhaps just using the wrong language

Well I think there is a tension here. C is the language for microcontrollers and the language for high performance.

In ye olden days both groups interests were aligned because speed in C was about working with the machine. Now the UB has been highjacked for speed, that microcontroller that I'm working on, where I know and int will overflow and rely on that is UB so may be optimised out, so I then have to think about what the compiler may do.

I wouldn't say C is the wrong language. I would say there are wrong compilers though.

> At this point UB and leveraging it for optimization are core parts of the most performant C implementations.

I am skeptical that NULL-pointer checks being removed contribute anything more than a rounding error in performance gains in any non-trivial program.

I got a measurable improvement from eliminating a null pointer check within the last week. Billions of devices have arm little cores, and the extra branch predictor pressure and frontend bandwidth from those instructions can be significant.

A standard way to eliminate those is to invoke undefined behavior if some condition is not met;

    if (a == NULL) {
      __builtin_unreachable();
    }
Which then allows elimination of the null check in later code, possibly after inlining some function.
This series was a good explanation for me of why treating UB this way is genuinely useful: https://blog.llvm.org/2011/05/what-every-c-programmer-should...

Being able to assume certain things don't happen is powerful when you're writing optimisations, not doing that would have a real performance cost

> Being able to assume certain things don't happen is powerful when you're writing optimisations, not doing that would have a real performance cost

A few of those are significant performance gains, the majority are not.

Emitting the instruction for a NULL pointer dereference is effectively no more costly than not emitting that instruction.

It's the code removal that's killing me.

What if the compiler is able to use that to determine that a whole code path is dead, and then significantly improve the surrounding function because of that?

Compilers optimise in multiple passes and removing things earlier can expose optimisation opportunities later that can affect other parts of the code too

> What if the compiler is able to use that to determine that a whole code path is dead,

Then it should warn "unreachable code".

> and then significantly improve the surrounding function because of that?

It's not simply the removal that is the problem, it's that the code is silently removed.

Right. But to take the first example, the value of initialised memory.

It's undefined so it doesn't have to be zeroed therefore increasing efficiency.

But it's also UB so if you do know that memory contains something, you can't take advantage of that because it's UB. Having it UB is fine. It's the compilers assuming UB can't happen and optimising it away.

> Meanwhile, the actual computers we have been using for decades have no problems actually just loading 4 bytes through any arbitrary pointer with zero overhead. But no.

Not if those 4 bytes span a cacheline boundary, that will most likely result in 1/2 throughput compared to loading values inside a single cacheline. And if it causes cache-misses it takes up twice the L2 or L3 bandwidth.

Even worse, if the int spans two pages, it will need two TLB lookups. If it's a hot variable and the only thing you use from those pages, it even uses up an additional TLB entry, that could otherwise be used for better perf elsewhere, etc.

And if you're on embedded (and many C programs are), Cortex-M CPUs either can't handle unaligned accesses (M0, M0+) or take 2-3 times as long (split the load into 2x2 byte or 1x2 + 2x1 byte)

I don’t think any of that is justification for making unaligned access UB. It’s reason to avoid it or discourage it in certain scenarios, but it’s infinitesimally rare that loading 8 bytes instead of 4 is even measurable, and that includes embedded.
> that all pointers "are" just register-sized integer addresses

And crucially until DR#260 https://www.open-std.org/jtc1/sc22/wg14/www/docs/dr_260.htm this was a reasonable guess as to what the pointers are. Probably not a wise guess because it's not how your C compiler worked even then, but a reasonable guess if you didn't think too hard about this.

One way I like to think about this is that all C's types are just the machine integers wearing crap Halloween costumes. Groucho glasses for bool, maybe a Lincoln hat for char, float and double can be bright orange make-up and a long tie. But the pointers are different, because unlike the other types those have provenance.

5 == 5, 'Z' == 'Z', true == true, 1.5f == 1.5f, but whether two pointers are equivalent does not depend solely on their bit pattern in C.

I'm not sure that's right. For instance, the Pentium 4 spec explicitly says unaligned int32 loads take longer. And x86/x64 is very gentle in that regard, other archs would whip you. So an unaligned int access is rightfully treated differently. It should be IB.

Just creating the pointer, though, should not be UB, even though it apparently is. It should not even be IB.

Also, it’s been way more than a decade since Pentium 4 was remotely relevant.
> Meanwhile, the actual computers we have been using for decades have no problems actually just loading 4 bytes through any arbitrary pointer with zero overhead.

PCs yes, but there are many other things C is compiled to for which this is not true.

C isn't a programming language. It's not even portable assembly. It's a vague suggestion of a program that might or might not be feasible to run on a target computer and the compiler and other diagnostic tools are under no obligation whatsoever to help you find out what, if anything, is wrong with your program. It's user hostile and should be relegated to the bad old days.
Except ARM32. ARM64 doesn't guarantee it to be valid in all cases either.