Hacker News new | ask | show | jobs
by greysphere 33 days ago
The examples aren't really undefined behavior. They are examples that could become UB based on input/circumstances. Which if you are going to be that generous, every function call is UB because it could exceed stack space. Which is basically true in any language (up to the equivalent def of UB in that language). I feel like c has enough actual rough edges that deserve attention that sensationalism like this muddies folks attention (particularly novices) and can end up doing more harm than good.
5 comments

Ada 83 has no UB on call stack overflow, from the reference manual :

http://archive.adaic.com/standards/83lrm/html/lrm-11-01.html

"STORAGE_ERROR This exception is raised in any of the following situations: (...) or during the execution of a subprogram call, if storage is not sufficient."

So it's just as useful as when your stack area ends with a page that will segfault on access, or your CPU will raise an interrupt if stack pointer goes beyond a particular address?

It's not safe though because throwing an exception, panicking, etc, is still a denial of service. It's just more deterministic than silently overwriting the heap instead. If the program is critical then you need to be able to statically prove the full size of the stack, which you can do with C and C++ with the right tools and restrictions.

You're mixing specification (a language reference manual) and implementation (a given compiler, target, options, ...).

The Ada language specification says the Ada programmer can expect any Ada compiler when used in fully compliant mode to properly raise STORAGE_ERROR when a stack overflow occurs.

Only the Ada compiler writer has to deal with this, not every single programmer on every single program and platform (the UB behaviour of some languages).

In the case of GCC/GNAT the compiler manual provides insight on how to be in compliant mode per target regarding stack overflow, what are the limitations if any. You have tools to monitor and analyze you Ada code in this respect too.

Deterministic, well-defined behavior is inherently safer than undefined behavior. It allows you to diagnose the problem and fix it. UB emphatically does not, and I don't dare to think of how many millions of person-hours are wasted every year dealing with the results.
A segfault is considered safe if you're talking about functional safety because it results in a return to a defined safe state (RTDSS).

If a segfault leads to some other state you do not deem "safe", such as a single program gating access to a valuable asset with a default fail state of "allow", you just have a fundamental design flaw in your system. The safety problem is you or your AI agent, not the segfault.

A reliable 'denial-of-service' is better than silently doing the wrong thing.

> If the program is critical then you need to be able to statically prove the full size of the stack, which you can do with C and C++ with the right tools and restrictions.

Only for specific implementations; and neither C nor C++ standard guarantee that your program won't overflow the stack anyway. (Have a look, the standard says nothing about stack overflows, they still occur in practice with supposedly compliant compilers. So it seems that there are no rules about when they can happen.)

That's not true at all.

First, you can define what happens when stack space is exceeded. Second not all programs need an arbitrary amount of stack space, some only need a constant amount that can be calculated ahead of time. (And some languages don't use a stack at all in their implementations.)

Your language could also offer tools to probe how much stack space you have left, and make guarantees based on that. Or they could let you install some handlers for what to do when you run out of stack space.

The examples are unequivocally UB. Full stop.

How to think of this properly is that when you have UB, you are no longer under the auspices of a language standard. Things may work fine for a time, indefinitely even. But what happens instead is you unknowingly become subject to whimsies of your toolchain (swap/upgrade compilers), architecture, or runtime (libc version differences).

You end up building a foundation on quicksand. That's the danger of UB.

> The examples are unequivocally UB. Full stop.

Tbh, already the first example (unaligned pointer access) is bogus and the C standard should be fixed (in the end the list of UB in the C standard is entirely "made up" and should be adapted to modern hardware, a lot of UB was important 30 years ago to allow optimizations on ancient CPUs, but a lot of those hardware restrictions are long gone).

In the end it's the CPU and not the compiler which decides whether an unaligned access is a problem or not. On most modern CPUs unaligned load/stores are no problem at all (not even a performance penalty unless you straddle a cache line). There's no point in restricting the entire C standard because of the behaviour of a few esoteric CPUs that are stuck in the past.

PS: we also need to stop with the "what if there is a CPU that..." discussions. The C standard should follow the current hardware, and not care about 40 year old CPUs or theoretical future CPU architectures. If esoteric CPUs need to be supported, compilers can do that with non-standard extensions.

Not having unaligned access in the language allows the compiler to assume that, for basic types where the aligment is at least the size, if two addresses are different then they don't alias and writes to one can't change the result of reads from the other. That's a very useful assumption to be able to make for optimization - much more useful than yolocasting pointers in a way that could get you unaligned ones.
> if two addresses are different ...

Eh, if the compiler knows that two addresses are different at compile time, it also knows how big the difference is.

Usually this is not the case.
Indeed one of the fun LLVM bugs is that it can arrive at a situation in which it believes pointer A and pointer B are definitely not equal (weird given what's about to happen but OK that's potentially fine...) then we ask for their addresses† as integers X and Y, LLVM insists those integers aren't equal either because the pointers weren't (which as we're about to see is wrong) and then we subtract X - Y or Y - X and the answer either way is zero. Awkward. The integers were definitely equal.

† Although on a real modern CPU the pointer "is" just an address, notionally it has three components, the address, an address space (modern machines typically only have one) and a "provenance".

I agree. I meant to elaborate more on how to think of UB.

For most C software on x86_64, UB is "fine" with very strong bunny ears. But it is preferable for one to, shall we say, write UB intentionally rather than accidentally and unknowingly. Having an awareness of all the minefields lends for more respect for the dangers of C code, it makes one question literally everything, and that would hopefully result in more correct code, more often.

On that note, on some RISC-V cores unaligned access can turn a single load into hundreds of instructions.

I think the problem is just that C is under specified for what we expect a language to provide in the modern age. It is still a great language, but the edges are sharp.

How is UB fine? It can make your programme rather exploitable, depending on compiler. And the compiler might change her mind tomorrow.
Undefined means that the ISO C doesn't define the behavior. An implementation is free to do so.
If they do, that is no longer an implementation of C. It is a dialect of C, and there are many (GNU C being the most popular), but there are real drawbacks to using dialects.

This is in contrast to the other category that exists, which is "implementation-defined".

The thing is that the actual compiler behaviour matters more for real-world projects than what the C standard says. E.g. the C standard was always retroactive, it merely tried to reign in wildly different compiler behaviour at the time when the standard was new. It mostly succeeded, but still the most useful C and C++ compiler features are living in non-standard extensions.
Unaligned access being fine in one architecture, but not in others would create separate dialects, regardless of being blessed by ISO C.

Just don't do unaligned access, it's a dialect that doesn't exist currently, and should never exist.

> If they do, that is no longer an implementation of C.

This is plain wrong. Undefined behaviour, means the C standard specifies no restriction on the behaviour of the program, which is what the implementation chooses to emit. An implementation can very well choose to emit any program it pleases, including programs that encrypt your harddisk, but also programs that stick to well defined rules.

Sure, but the point is that code written against such a compiler is not C and is not portable. It is written in a dialect of C, and that comes with drawbacks.

Writing C (or any language) means adhering to the standard, because that's the definition of the language.

There are still modern CPUs that don't support misaligned access. It would be insane for C to mandate that misaligned accesses are supported.

However I do agree that just saying "the behaviour is undefined" is an unhelpful cop-out. They could easily say something like "non-atomic misaligned accesses either succeed or trap" or something like that.

> In the end it's the CPU and not the compiler which decides whether an unaligned access is a problem or not.

Not just the CPU - memory decides as well. MMIO devices often don't support misaligned accesses.

> They could easily say something like "non-atomic misaligned accesses either succeed or trap" or something like that.

That means that the compiler must emit the read, even if the value is already known or never used, as it might trap. There is a reason for the UB!

No it doesn't. Compilers are only required to emit the read for volatile types. If the type is non-volatile, misaligned, and can be optimised out then it would be perfectly fine to omit it (that would be the "succeed" option).
If a trap is observable behaviour, then the compiler either needs to add code, that checks for the condition and then traps explicitly or it needs to actually perform the read. Currently it can be optimized out, because it is UB.
On hardware that doesn't support it, misaligned loads could be compiled to multiple loads and shifts. Probably not great for performance, and it doesn't work if you need it to be atomic, but it isn't impossible.
That is only really possible if you know the pointer is misaligned at compile time (which does happen, e.g. for packed structs). The examples in the article are for runtime misalignment. It would be crazy to generate code so that every function checked if every access was aligned at runtime.

(Note the normal way to handle that if the hardware doesn't actually support it is for the access to trap and then the OS or firmware emulates it.)

That still requires detecting when a misaligned load happens.
For x86 SSE there are aligned instructions that will trap on unaligned access.
The first example is dereferencing an integer pointer. That is a valid operation. Now if that pointer isn't valid (and being unaligned is one of many reasons it could be invalid) then calling the function with that invalid pointer will be UB.

An honest discussion would be something more like 'dereferencing pointers can lead to UB on invalid pointers. Here are N examples of that. Maybe avoid using pointers. Maybe consider how other languages avoid pointers. Maybe these shouldn't be UB and instead some other class of error.' And then even more honest discussion would present the upsides of having pointers and the upsides of having these errors be UB.

Instead, the article (and your comment) take this valid operation and presents it as invalid. Imagine you're a new programmer, you are just starting to wrap your head around pointers and you stumble across this article. You see the first example and it looks exactly what you would expect a dereference to look like. But the article claims it's wrong, and now you're confused. So you dig into the article more closely and are exposed to all these terms like UB, alignment, type coercion etc and come away more confused and scared and disinclined to understand pointers. This is classic FUD. This is a technique to manipulate, not educate.

Pointers have pros and cons. UB has pros and cons. Let's try to educate people about them.

There is an important distinction here to the technical meaning of UB that is lost to many.

UB simply means the operation you are intending to perform has no defined semantic under the ISO C specification. That is all. Understand what this means but do not read further into it. It is easy to read further into this as you have and many do, and come to incorrect conclusions, and think this MUST result in incorrect behaviour, but this is not the claim. The claim is rather than once you write UB, you are no longer writing C the language with a defined spec, and that any manner of degrees of freedom (architecture, toolchain, etc) can now cause your code that was once behaving correctly to now behave incorrectly. That is the danger.

> That is a valid operation. Now if that pointer isn't valid (and being unaligned is one of many reasons it could be invalid) then calling the function with that invalid pointer will be UB.

This is incorrect. The moment you express this in source code, it is already UB wrt to the C abstract machine.

6.3.2.3. 755 If the resulting pointer is not correctly aligned for the pointed-to type, the behavior is undefined.

https://c0x.shape-of-code.com/6.3.2.3.html

The important distinction is to KNOW this is still UB; whether the operation yields the expected behaviour on your platform and architecture is completely a separate question.

The reason this is of utmost important is because the C compiler operates on the C abstract machine.

If you violate language invariants, the compiler can--keyword can--emit WRONG code and it will be CORRECT to do so because C unfortunately allows it to. When this happens it's silent and deadly and it's a pain to debug. The point of all this seeming language lawyering is not FUD, it is genuine frustration with these footguns of the language that we are trying to share with others. Understanding UB correctly really is what separates those that know C and those that "know" C.

Things will work and then they won't. This can be fine for most cases but not fine for others. If you use C in 2026 you need to understand this.

> come away more confused and scared

This is the correct take. One aught to be more confused and scared after learning about UB; the language simply leaves things under-specified and it is up to the developer to understand they are engaging in UB.

Once UB is acknowledged, one aught to impress upon themselves the software they build is dependent ever more on the whims of their particular compiler (clang/gcc), compiler flags (optimizations), architecture, and runtime environment.

Maybe I'm misunderstanding. Here is what I'm trying to say.

"Accessing an object which is not correctly aligned" - this is UB

"As an example of this, take this code: ..." - this (code) is not UB.

Is this incorrect somehow?

You could interpret the second sentence as 'under the assumption of an unaligned pointer, let's look at what this seemingly innocuous (and correct) code does.'

But that's not what they did. They presented that code as if it's incorrect (following the whole premise of the article 'Everything in c is UB'). That's what the whole article does, they take a topic with real concerns, then present 'normal' code, and then imply the code is the issue (and therefore the language), not the premise.

You know what would be better, show an example that clearly shows the complete path for the premise to the issue. Ie show some code that generates an unaligned pointer and then uses it. Why did the author not do that? Surprise, because it's actually pretty hard to write code that's 'guaranteed' unaligned behavior.

    int foo[10];
    int *bar = (int *)(((int)&foo) + 1);
Is this unaligned access? You don't know because you don't know the size of int. (Not to mention it looks ridiculous. By only showing 'reasonable' code as the example, the article suppresses the common 'uh just don't do that' criticism.)

And in fact the ambiguity of alignments and sizes is the whole point - they are given the privilege/footgun of being undefined in c so that compilers are easier to write. It's very debatable if this was/is a good idea, but that's where the debate should be, not illusorily ascribed to derefing pointers.

If I'm misunderstanding, please let me know. Specifically, if you're claiming (1) either the literal code in the first box of the article is UB, or (2) please write some literal code that is UB in the vein of the first claim of the article. I think that would help me bridge the gap that we seem to be having.

Edit: I think one part of the confusion is we were addressing different parts of the first example of the article. You were referencing the int foo(..) snippet (which I agree has no UB), but I was referencing the parse_packet() snippet (which has UB by construction), which was also part of the first example :).

You are beginning to understand. Yes, surprisingly, it is (1) that is being claimed.

The mere expression is alone UB. Yes, you read that right. In source code, it's already UB. Why? Because the ISO spec defined UB that way. But you see, what this means in practice ie whether "it works" is an entirely separate question and would be specific to toolchain, hardware, runtime, the alignment of the pointer in question, blah blah.

There is nuance here, and that's why this topic is debated to death, because it's hard to explain and it is genuinely complex.

When people say something is UB, they mean to say that the behaviour is undefined--wrt to ISO C.

The behaviour that actually matters IS defined wrt toolchain, hardware, runtime, alignment of pointer in question.

But that's exactly it--the latter is not what we mean when we say something is UB, when we say something is UB we are talking about the ISO C spec. The important follow up question then, when knowingly invoking UB, is to ensure your environment is "correct", because you have now crossed into realms entirely out of the auspices of the ISO C spec. Ergo, you are now in UB land; what you thought was the foundation of your codebase, the ISO C spec, has now turned into quicksand.

It is this implied undocumented dependence on factors external to the source code that is a huge source of bugs and surprisal.

So take this example from the article. Yes, it is UB by construction.

(edit: i copied the wrong fragment initially -- if you were talking about the int foo(const int* p) fragment, yes that block is not by construction UB)

    bool parse_packet(const uint8_t\* bytes) {
            const int\* magic_intp = (const int*)bytes;   // UB!
            int magic_raw = foo(magic_intp);  // Probably crashes on SPARC.
            int magic = ntohl(magic_raw); // this is fine, at least.
            […]
    }
Why?

> Because the compiler is not obligated to generate assembly instructions that work on unaligned pointers. Because it’s UB.

Does it actually work though? It might and it might not: there is simply no guarantee from the language. But that's all it says. It may very well work on your arch and platform and toolchain, indefinitely. But again circling back, for code written like this to be so brittle, that is why UB is to be avoided.

And to your point:

> that's where the debate should be, not illusorily ascribed to derefing pointers.

But that is where the debate is. People just do not understand what UB actually means. The article is correct: everything in C is UB. The takeaway is not that, therefore all C code is irredeemably broken (well, to some people it does mean that, anywho..). The takeaway is that most C code IS in fact more delicate than one may originally believe, because of the fact ISO C is under-specified, to allow for specialization dependent on toolchain/arch/hardware/what have you etc.

So it is incumbent on the developer when writing C to correctly acknowledge when they are invoking UB, and to do so intentionally with the awareness that things may just randomly break one day.

Thanks, yes I think we had some confusion on the foo() vs parse() and I was referring to foo().

But even for the parse() example, the issue is the aliasing rules (and not the alignment - though that could still be an issue depending on input!) Aliasing isn't even mentioned in the article. Instead the example presents this thing 'people do all the time' and identifies it 'UB' without even identifying the actual issue.

On its own I could forgive the former, making a precise example is tricky (particularly with alignment issues). But this is repeated: milliseconds() is characterized 'UB' because it's inputs could be outside the representable range. Again the function is not UB, the inputs can potentially trigger UB.

Then the function pointer example obfuscates the assignment (fine) with the call (ub). Despite the red herring statement 'NULL compares unequal to any object or function' as the example assigns a function _pointer_ which can be NULL. The honest example of the statement is:

    void foo() = NULL;
which won't even compile because it violates the thing that was just said (among other reasons). The UB is the call below and has nothing to do with equality with NULL.

The repeated pattern of say one thing, show an example that's 'reasonable' and implies that it's related to what was just said, and from that invalid relation conclude that everything is UB in C feels dishonest (particularly when it's so easy to talk about UB in C honestly because there is so much to be legitimately concerned about!)

I appreciate your and your advocacy about UB in C, and I think I agree with most of your points about it and they worth discussing. That's why the article itself is frustrating to me, we don't need to be tricksy when talking about UB, it's already tricky enough!

UB based on input can be an exploit vector.
Unvalidated input can always be an exploit vector.
Except in C, validation of user input can in itself be an exploit vector.
That’s true in other languages as well. Any programmatic task can end up being an exploit vector.
No? That's the whole point of formal verification?

You can even kind of retrofit this to C. The classic example is "sel4". You just need a set of proofs that the code doesn't trigger UB. This ends up being much larger and more complicated than the C itself.

You can fail to verify something which you actually wanted to verify (i.e you made a proof of something else instead of the thing that mattered). See WPA2 KRACK as an example.
Yeah, but only in C* can those errors end up as more UB.

* terms and limits may apply.

Many other languages also have UB. But almost none is as trigger happy as C and C++ are.
Turtles all the way down.
Yes, this article is pretty much the definition of FUD.