Hacker News new | ask | show | jobs
by simonask 608 days ago
Awesome writeup. Always interesting to read what Daniel has to say.

I think the fact that it turned out that he was wrong (and UBsan was right, as usual) is a great testament to the shortcomings of C.

Lots of people - both inexperienced and very experienced - celebrate it for being "simple" and "close to the hardware", but the truth of the matter is that it is precisely not close enough to the hardware for people who _know_ what the hardware is doing to be able to do what they expect, and it's too close to the hardware to be able to be able to ignore it.

Lots of experienced C programmers (and - guilt by association - C++ programmers as well) run into UB because they have clear expectations of the compiler. I.e., they know what the compiler should generate, more or less, and C is just a convenient notation. But compilers don't live up to those expectations, because they don't actually compile your code for the hardware. They compile it to the virtual machine abstraction defined by the standard, which very often works differently from any real architecture, and then translate that into machine code. Even though there is basically a single set of semantics that every single "relevant" (mainstream) architecture implements. This is a holdover from when C had to target architectures that are 100% irrelevant today.

Everybody's favorite example is signed integer overflow. In both x86-64 and ARM64, that just works - two's complement is the only relevant implementation, so there's no issue. But `int` in C and C++ is not that.

Almost every single common UB pitfall has reasonable behavior at the assembler level for every mainstream architecture, and almost every single niche architecture.

C gives you the illusion of being close to the hardware, but in actual reality the hardware is several steps removed, so if you want to leverage your knowledge of the hardware, calling conventions, assembly, or other low-level details, you have to go out of your way to work around the C standard.

(Aside: We need new languages to tackle this, and I coincidentally happen to like Rust. Lots of people coming from C or C++ are irritated and frustrated by Rust, but 99% of the time it's because Rust gives you a compile error where C would give you UB. This is one example of that out of thousands.)

7 comments

Many of such issue were also fixed in languages like Ada and Modula-2, among others, which C and C++ folks used to call "programming with a straighjacket".

Many folks will complain about anything that breaks their beloved illusion that C is a fancier macro assembler, except even proper macro assemblers have less UB than regular C.

Being close to the hardware isn't a dogma. The point of being close to the hardware is to be able to write code that is (nearly) as perfomant as it could be if you used assembler. Undefined behavior that let's the compiler better reason about your code like signed integer overflow is entirely compatible with that goal. If anything, unsigned overflow should also be UB unless you explicitly tell the compiler that an operation should wrap, which really is not something you want in the common case where any overflow already means you have a bug.
Ada, Modula-2, PL/I, Mesa, NEWP, PL/S, Object Pascal, and many others are just as close with much less UB.
Wouldn't a faster and easier solution for existing code be to come up with a standard set of behaviours based on common architectures and make that a flag that can be switched on so expectations can be met (and call it "expected-C" or for the pun "ocean")?
AFAIK the reason for signed overflow still being UB isn't any exotic hardware property but because it enables some specific optimizations. Whether those optimizations are worth the trouble is up for discussion I guess.
No, it used to be the case that there were architectures where signed integers were represented as 1s complement, so portable code could not rely on signed integer overflow wrapping around (there would either be 2 bit patterns representing zero, or sometimes the all-ones pattern had special meaning, like a trap).

Using this type of UB is a "relatively" new thing (GCC started doing it in the 00s, which broke a lot of stuff in the Linux world, IIRC).

It _is_ true that somebody did the research (can't find the source right now) and found that defining signed integer overflow as wrapping did indeed make some code run slower. I'm skeptical that it matters.

That's why I wrote "still". AFAIK both C++ and C now expect integers to be in 2s complement and made unsigned overflow 'defined', but at the same time kept signed integer overflow as undefined behaviour.
unsigned overflow was always defined
Hmm true, now I wonder what the standard change that integers are expected to be in two's complement format even means in practice when the only important related UB (signed overflow) is still UB. Guess I'll need to read the original proposal again.
C does give a lot of low-level control, but it is still an abstraction - as it should be.

Signed integer overflow is something were I personally like that it is UB, because I can turn on UBSan and find overflow bugs. If it were defined to wrap-around I could not do this.

Does anyone know of languages that achieve this? I'm interested, I'm currently implementing a project in x86 assembly for this reason, and am happy to try a higher level language.
Rust for example has virtually no UB (edit: no UB) in the safe subset, and it’s rare in most applications to have to use unsafe.
Just to nitpick, it's not "virtually" no UB, it's literally no UB.

(Barring compiler bugs, of course.)

And issues in the unsafe parts, which could cause UB in the safe part.
You are right, thanks for the correction.
Sorry, I meant a "close to the hardware" language re parent.
Depending on what you mean exactly I think Rust can be considered pretty close to the hardware. I.e. there's usually an obvious straightforward translation of each line of source code to an implementation in assembly language (which may not be what's actually emitted in practice due to optimizations, but the same is true in C).

There are of course some higher-level features like trait-based generics, so it's not really as close to the machine as C, but it's a lot closer to C IMO than something like Java (or even C++).

Ada, Modula-2 (both available on GCC), D (available on its reference implementation, GCC and LLVM), Swift, Object Pascal (Free Pascal, Delphi, Oxygen), Oberon variants (Oberon, Oberon-2, Oberon-07, Active Oberon, Component Pascal), Zig, Nim, Odin,...
All systems programming languages have undefined behaviour, the golden question is how much.

Also it wasn't what the parent asked for, and I quote: "close to the hardware" language

Rust is not (much?) further from the hardware than C++
C++ is further than rust in my opinion. Vtables that support inheritance, as well as stack unwinding for exceptions, are pretty complicated and totally implicit in C++. Okay rust also technically has unwinding for panics to be fair but it’s rather unusual in practice for programmers to use panics to mean anything other than “crash the program now”.
Like trait implementations, and trait objects.
I think this is more of an edge case than you give it credit for.

(External linkage) functions, and their callsites, are quite special as they straddle the boundary between actual machine and virtual machine.

The calling convention makes fairly strict requirements of what that must compile down to, since caller and callee could be dynamically linked objects compiled by different compilers or even entirely different languages.

If you define a function that takes a void* parameter it will read a pointer from the RDI(?). If you pass a char* it will pass the pointer in that same register.

Now it must be said that if the compiler can see both the definition and call, then it's free to do whatever it wants e.g. inline or something.

But yeah it's a bit complicated?

Sure, I mean, my objection is with the pretense that a reasonable calling convention in 2024 could decide to represent void* differently from any other T. It makes total sense that C programmers expect pointers to be (transitively) convertible to/from void, so the fact that they aren't convertible in this way without a trampoline means that the standard contains surprises for even very experienced developers.

I postulate that almost all UB in the wild comes from the Standard diverging from (often very reasonable) expectations, and I see that as a big problem with the standard, at least as long as compilers can't reliably detect the problem at compile-time. (And yes, C++ is even more problematic here.)

I think one of the reasons Zig exists is that it contains far fewer surprises. The reason Rust exists is that it does a much better job at preventing and containing such surprises.

Every C compiler I ever used will tell you that void()(void) is not convertible to void ()(char). If people still do it then they are a bit on their own. But how is this then different to Rust's "unsafe"? (of course, there is other UB compiler do not tell you about, that this seems a bad example)
What if the function is written in assembler and takes some pointer to some opaque memory region?

void (*)(char *) and void (*)(void *) and even void\()(struct SomeStruct*) where SomeStruct is declared but never defined could all be correct and reasonable ways to declare that function in a c-header.

For signed overflow it's fascinating, Herb Sutter (WG21 convenor, Microsoft employee) writes that checking would "incur unacceptable costs".

Now, C++ programmers, were you consulted about these costs? Herb insists they're "unacceptable" but he provides no further information as to what the cost actually was, or who decided whether that cost was acceptable, much less how this could generalize across a wide variety of domains and platforms.

What's Herb's answer? You might hope that Herb would say OK, we'll provide wrapping for these types by default, it's not checked arithmetic but at least it's not UB. Nope.

C & C++ lack a particular sort of behavior that could solve a bunch of problems.

They've got undefined behavior: The standard imposes no requirements. Anything can happen. Assume the worst.

They've got unspecified behavior: Only covers behavior caused by an unspecified value or where the standard provides multiple choices, where the implementation need not document which choices are made in which situations.

They've got implementation-defined behavior: Unspecified behavior that the implementation must document.

They don't have a category for "Undefined behavior but the implementation must document". A lot of what is currently undefined behavior could better be put into this category, if it existed.

C++ 26 will (almost certainly, it's in the draft) add Erroneous Behaviour.

EB is well defined but definitely wrong, so it's a way for the standard to say:

Do not allow this to happen, but if you do, the consequence is definitely that.

The specific EB in the C++ 26 draft is the value of default uninitialized primitives. So e.g. int k; std::cout << k << "\n";

In C++ 23 and previous versions that's Undefined Behaviour, maybe it prints the lyrics to the National Anthem of the country where the compiler ran? Maybe it deletes all your files. But in C++ 26 the Erroneous Behaviour is that there's some integer value k, which your compiler vendor knew (and might tell you or even let you change it) and it prints that value, but you're naughty because this is definitively an error when it happens.

With GCC/clang can just add checking with -fsanitize=signed-integer-overflow -fsanitize-undefined-trap-on-error.

For my main software project, which is some numerical software for magnetic resonance imaging, this adds 12212 checks and the optimizer reduces them down to 3803. But I haven't done benchmarking yet, but I would guess that for most software it would not matter.

Basically the behavior is hardware-dependent, and nobody wants to mandate that C++ compilers generate a ton of extra instructions on hardware which does not behave a particular way.

Of course you can define your own checked integer types, using inline assembly to check the overflow flag where available.

Just so we're clear, yes, it's "hardware-dependent", but literally every single architecture and CPU model does the same reasonable thing, which is to wrap into the negative.

Any architecture that doesn't use 2s complement is so esoteric by now that it does not make any sense for a general-purpose C compiler to pretend they exist.