Hacker News new | ask | show | jobs
by i-use-nixos-btw 1172 days ago
This is written with quite a lot of hyperbole.

The predominant focus is realloc(pre,0) becoming UB instead of what the author misleadingly describes as useful, consistent behaviour. It is far from that, and that’s the entire reason that it was declared UB in the first place: https://www.open-std.org/jtc1/sc22/wg14/www/docs/n2464.pdf. Note that this wasn’t a proposal to change something, it’s a defect report: the original wording was never suitable.

The second part is the misconception about the impact of UB. Making something UB does not dictate that its usage will initiate the rise of zombie velociraptors. It grants the implementation the power to decide the best course of action. That is, after all, what they’ve been doing all this time anyway.

Note that this deviates from implementation-defined behaviour, because an implementation-defined behaviour has to be consistent. Where implementations choose to let realloc(ptr,0) summon the zombie raptors, they are free to do so. Don’t like it? Don’t target their implementation. Again, this isn’t a change from the POV of implementers - it’s a defect in the existing wording.

In this case, the course of action that any implementation will choose is to stick with the status quo. It is clearly not a deciding factor in whether or not you embrace the new standard, and to suggest otherwise is dishonest, sensationalist nonsense. The feature was broken, and it’s just being named as such.

6 comments

I agree that realloc was poorly defined for the 0 size case, I think UB or IDB both would have worked in this case to really drive that point home, the WG chose UB.

That being said, you're completely wrong about what UB means. Making use of UB may as well initiate the rise of zombie velociraptors. Except for the situation where your implementation explicitly specifies that it provides a predictable behaviour for a specific case of UB, there's literally no guarantee of what will happen. Assuming that the implementation will stick with some status quo and your code won't exhibit absolutely unusual behaviour is just naiive.

Please don't mislead people into thinking that it's ever a good idea to assume that undefined behaviour will be handled sensibly, this kind of mislead assumption is one of the major sources of bugs in C code.

> this kind of mislead assumption is one of the major sources of bugs in C code.

This is not even close to be true. Most bugs in C code are from programmer mistakes, not from UB behavior. The exaggeration that is spread by some people regarding UB is close to absurd. If something is UB, it may generate different results in different situations, even with the same compiler. The standard is just clarifying this problem. A good compiler will do something sensible, or at least issue a warning when this situation is detected. If you have a bad compiler that does strange things with your code, it's not a defect of UB but the compiler instead.

Optimizing compilers don’t work like that. They can either deviate from the standard and leave it as defined behavior, or mark it UB and go with it as usual.

To get some insight by analogy, consider this set of constraints (unrelated to C):

  x <= 7
  2x >= 5
  …(more with x, y, z but not more constraining x)…
When you feed this to a linear constraint solver, you may get anything from 2.5 to 7 as x. E.g. 3.1415926. Not because a solver wanted to draw some circles, but because it transformed your geometric problem into an abstract representation for its own algorithm, performed some fast operations over it and returned the result. Nobody knows how exactly a specific solving method will behave wrt (underconstrained) x given that the description above is all you have.

When you feed UB into an optimizer, you feed a bit of lava into a plastic pipe, figuratively. You’ll get anything from program #2500…0000 to program #6999…9999, where “…” is few more thousands/millions of digits. Run some numbers from there as an .exe to see if something absurd happens.

The nature of UB and optimizers is that you either relax UBs into DBs and get worse efficiency, or you specify more UBs and get worse programming safety. What happens in between can be perceived as completely random. And the better/faster the optimizer is, the more random the outcome will likely be.

The exaggeration that is spread by some people regarding UB is close to absurd

UB-in-code is absurd by definition, no exaggeration here.

> Most bugs in C code are from programmer mistakes

These most often lead to the triggering of UB. The reason why programmer mistakes lead to confusing bugs instead of simple and straightforward bugs which are easy to catch in the development process is mainly because UB imposes no restrictions on what the compiler should do. In the vast majority of UB cases the compilers simply don't do anything, and assume it can't happen. This is why dereferencing a pointer and then checking if it's null ends up eliding the null check (because if you've dereferenced it, it can't be null, that would be UB). Accessing past the end of an array is UB so it can't happen, therefore your compiler won't check for it. Accessing past the end of an array and accidentally reading from/writing to another variable - likewise.

UB encompasses ALL behavior for which the standard does not provide an explicit definition. The reason why the C standard provides explicit instances of UB usually boils down to clarifying situations where people were confused about whether something was UB or not. But if the behaviour is not defined in the standard, then it is by definition UB.

If I am not wrong, one major security bug that C programs usually face is buffer overflow, which is an undefined behavior.
Right, this should have been left to the implementor if they didn't want to standardize one behavior. Making it UB is the worst possible outcome. Yes, people who write portable code will still want to not rely on `realloc()`'s freeing behavior, but if you do and your realloc() implementation doesn't, then you suffer a leak, while if you do and realloc() decides to wipe your drive and make your power supply explode...
> Except for the situation where your implementation explicitly specifies that it provides a predictable behaviour for a specific case of UB, there's literally no guarantee of what will happen.

That situation is "when you have UBSan turned on".

> The second part is the misconception about the impact of UB. [...] It grants the implementation the power to decide the best course of action. That is, after all, what they’ve been doing all this time anyway.

Wrong, Wrong, Wrong.

UB allows the implementation to take any arbitrary course of action, without informing anyone, without documentation, without any conscious decision, without weighing anything to be better/worse. Nondeterministically catching fire and launching nuclear rockets is a completely compliant reaction to UB.

What you are describing is "implementation defined" behavior. That has to be deterministic, documented, and conforming to some definition of sanity. Examples are the binary representation of NULL, sizes of integer types or stuff like the maximum filename length. Sadly, too many things in C have "undefined behavior", too few have "implementation defined" behavior.

And UB has always been an excuse for compilers to screw over programmers in hideous ways. Programmers are rightfully afraid of any kind of new UB being introduced, because it will mean that whole new classes of bugs will arise because the compiler optimized out that realloc(..., a) where a might be 0, because thats UB, so screw you and your code... And this change is especially dangerous because it makes a lot of existing code UB.

And UB has always been an excuse for compilers to screw over programmers in hideous ways

Your reply was great up until this. Compiler writers aren’t looking to screw over programmers, they’re looking to make code faster. UB gives them the ability to make assumptions about what is and is not true, at a particular moment in time, in order to skip doing unnecessary work at runtime.

By assuming that code is always on the happy path, you can cut a lot of corners and skip checks that would otherwise greatly slow down the code. Furthermore, these benefits can cascade into more and more optimizations. Sometimes you can have these large, complicated functions and call graphs get optimized down to a handful of inlined instructions. Sometimes the speedup can be so dramatic that the entire application is unusable without it!

Many of these optimizations would be impossible if compilers were forced to assume the opposite: that UB will occur whenever possible.

The tool programmers have available to them is compiler flags. You can use flags to turn off these assumptions, at the cost of losing out on optimizations, if your code needs it and you’re unable to fix it. But it’s better to turn on all possible warnings and treat warnings as errors, rather than ignoring them, to push yourself to fix the code.

the thing that makes UB almost malicious is that it propagates inter-procedurally. This makes reasoning about code with UB basically impossible which means that you should always assume that the compiler is going to screw you over if you use it because there is no way to know whether it will.
You should consider a program with undefined behaviour to be the equivalent of a mathematical proof that contains an unstated contradiction. Ex falso quodlibet: from a falsehood anything follows. Also called the principle of explosion.

Undefined behaviour renders your entire program meaningless. It must be avoided at all costs. Using undefined behaviour on purpose is like sticking a fork in an electrical socket.

> Undefined behaviour renders your entire program meaningless

That's exactly the complaint. Consider that the implementations of the standard library sometimes have exposed UB: that renders behaviour of all of the running code on the system undefined.

Many programmers believe that the fallout of the UB could, and therefore should, be limited in scope.

To achieve your goal, compilers would have to disable any sufficiently powerful optimization. If you write bugs (UB), a powerful compiler will eventually catch them and generate code that you didn't intend at the beginning. However, this is not the fault of the compiler or the language.
It's funny that your original post was an objection to how undefined behavior gives license to screw developers over, but here you are talking about how undefined behavior is like sticking a fork in an electrical socket.
My original post was an objection to the implied intent on the part of compiler writers. An electrical socket does not have intent, it's just a hazard that also happens to provide enormous benefits to our lifestyles.

I think it's a perfect analogy to undefined behaviour in C: enormous benefits but also a hazard to be wary of. A lot of people don't understand the benefits, they just see the hazard. Throughout this discussion I've been trying to clarify that, with perhaps limited success.

But just to be clear @chongli is logical

Think of UB as a probabilistic error. I.e. it is always stupid to rely on it

1. Write code without errors -- sensible 2. Allow compilers to assume the absence of errors -- occasionally sensible, since it speeds up your program

In defence of UB, for the most part they are things that should break your program anyway: stack overflow is never correct. So your choice is mostly to fail badly quickly, or to fail slowly well

Thanks to google making the UB sanitizers you are free to make that choice even in C

That's not an argument to keep live grenades laying around, it's an argument to remove them from the spec.

Like signed int being UB. Define it to have 2 complement semantics. Problem solved. I'm sure the nutters trying to extend C++ with templates will howl but this is C not C++. And seriously C++ is dead man walking at this point.

C23 does make two’s complement standard. It also adds checked arithmetic so you can safely avoid signed overflow.

It does not make signed overflow defined behaviour. This would prevent integer operation reordering as an optimization, leading to slower code.

Until LLVM, GCC, key game engines and GPGPU SDK get rewritten into something else, it is going to be Resident Evil day for a looong time.
I wish UB were only as nasty as "nondeterministic behavior". In fact, if there's UB in anything the compiler sees, nothing at all can be assumed, including whether you even get an output. What you've given the compiler isn't C, so it doesn't have any obligations to do anything with it. The codepath with UB doesn't have to run for the nuclear rockets to launch and the nasal demons to appear.

Since approximately every nontrivial program ever written has UB, in actual practice we're only saved by the fact that compilers aren't entirely maliciously compliant.

That's not true. If the program's execution path from start to finish avoids UB then you're safe. (Also the source code itself has to avoid UB, but that part isn't hard.)

It's true that code with UB does not have to be reached, per se, but it does have to be something your program will reach before it can hurt you.

You're correct in practical terms, but I'm making a very pedantic point about what the standard requires happen, mainly because this pedantry has important implications for e.g. safety critical C. Note 1 to the definition in 3.4.3 provides some clarification about the extent of UB and states that UB can manifest at translation time. It also gives says that the translator should behave in a documented manner when encountering UB, but does not require that it do so.
C has both translation-time UB and runtime UB. (C++ explicitly separates the two concepts into "ill-defined, no diagnostic required" and "undefined behavior".) You can tell them apart from the condition for UB to occur: if it's a translation-time condition, then it's translation-time UB, and if it's a runtime condition, then it's runtime UB. (Same with implicit UB: is it a translation-time or a runtime assumption being violated?)

Usually when we talk about UB, we're implicitly talking about runtime UB, since translation-time UB is generally far less subtle. If a program contains only conditional runtime UB, the compiler is not permitted to break the entire program from the very beginning, since all possible executions that do not trigger runtime UB must execute correctly as per 5.1.2.3.

5.1.2.3 only binds conforming programs. Programs containing UB are by definition non-conforming.

I hadn't considered the C++ standard here, but 1.9 is much more clear than corresponding C verbiage. 1.9.5 is exactly what's described upthread, where any "execution [that] contains an undefined operation" has no prescribed behavior. But the note to the requirement immediately before that (1.9.4) doesn't use that language and instead "imposes no requirements on programs that contain UB". If they had intended only to avoid specifying semantics for programs that hit UB during some possible execution, they would have used the same language as 1.9.5.

Fine. HN is, after all, a place where you can be pedantic.

But those of us who are actually writing programs mostly care about "in practical terms", and in practical terms, this doesn't happen, so we don't care. We've got enough trouble worrying about what does happen; we don't have time and energy to worry about what doesn't and won't happen.

To provide some more context/motivation for why you might care, I write safety-critical code. I'm often advising people what they need to do for certification, etc. If all you need to do is ensure that you never execute undefined operations and knock out the list of specified UB, that's totally, 100% manageable. Throw some sanitizers on, provide realistic input, and test the hell out of it. Normal stuff.

If the reality is that any UB can invalidate the entire program (as is the interpretation taken by other standards re: C), then that's not remotely sufficient. You have to ensure the complete absence of UB.

That's like saying: "I don't care what the standard says!"

Sure, this is perfectly fine.

Only that you're not writing any C/C++ than, but something in the "gcc 12 language with some switches", or maybe the "LLVM 15 language with some switches", or something like that.

> approximately every nontrivial program ever written has UB

You can replace "UB" for "bugs" and the result is the same. UB is a bug on the part of the programmer, from the point of view of C, similar to dereferencing a null pointer. When the standard says that something is UB, it is just clarifying what these situations are.

What the standard explicitly calls out as UB is only a small subset of actual UB.

While you can certainly classify all UB as "bugs", doing so misses the critical differences between UB and other categories of bugs. If you have a logic bug for example, your program will correctly and consistently do the wrong thing. It will continue doing that wrong thing with a different compiler, on a different platform today and 10 years from now. Implementation defined behavior is a bit looser, but will still be consistent with any particular implementation (which will document the behavior) and will only manifest in the code that depends on it. A PR inserting one of these "normal" bugs doesn't invalidate the entire rest of the program.

UB is different. You can't make assumptions about UB because from the point of view of the standard, UB is "not C". There are no assumptions to be made, it's just all the stuff that doesn't have assigned semantics. And since the input is meaningless, so is the entirety of whatever the compiler gives you back.

> If you have a logic bug for example, your program will correctly and consistently do the wrong thing.

Not correct. Bugs can occur differently in different architectures, even in high level languages. UB is just a kind of bug whose effect depends on how the compiler behaves, so you have to be careful to test your code on different compiler settings. This is nothing new on programming languages, it is only made explicit in the C standard. Suddenly people started to believe that pointing out the obvious source of bugs (UB) in the standard is equivalent to let programs misbehave.

I'm not sure if you're making a point about "unspecified behavior" (where the compiler can choose between multiple valid behaviors), but no, a strictly conforming program will have the same semantics on different architectures. Strictly conforming programs can still have bugs, but their nature is completely different than UB because that's the point of the standard.
> you have to be careful to test your code on different compiler settings.

The problem is you have to test your code on compilers that don't exist yet with compiler settings that do different things from any compiler that ever might exist.

Bugs are UB-like in a sense (what's the code going to do? well, you'll have to think about it, or try it and see), but UB is strictly worse than bugs (different compilers, even different versions of the same compiler, can do radically different things way beyond the scope of the bug).
That's exactly why a compiler shouldn't be able to 'optimize' in the face of UB, it should be an ERROR and the section of undefined behavior highlighted in the error message.
This would mean you’d have to insert a check every time you add two signed integers together, because signed overflow is UB. You’d also have to wrap every memory access with bounds checks, because OOB memory access is UB.

There are also tons and tons of loop optimizations compilers do for side-effect free loops which would have to be removed completely. This is because infinite loops without side effects are UB. So if you wanted these optimizations you’d have to prove to the compiler — at compile time — that your loop is guaranteed to terminate since it is not allowed to assume that it will. Without these loop optimizations, numerical C code (such as numpy) would be back in the stone ages of performance.

Edit: I just wanted to point out that one of the new features in C23 is a standard library header called <stdckdint.h> that includes functions for checked integer arithmetic. This allows you to safely write code for adding, subtracting, and multiplying two unknown signed integers and getting an error code which indicates success or failure. This will be the standard preferred way of doing overflow-safe math.

Another option would be to define behaviors for integer overflow and out of bounds memory access. Presumably they happen fairly often and it might be a good idea to nail down what should happen in those cases.
> you’d have to insert a check every time you add two signed integers together,

This is exactly what is done in serious code. It is typically combined with contracts and static analysis (often human), e.g. "it is guaranteed that this input is in range 10-20, so adding it with this other 16 bit int can be assumed to be below sint32_max".

> because signed overflow is UB

no longer

Doing that at compile time would require being able to perfectly predict everything the program can do, which is equivalent to solving the halting problem (make the program do something undefined after it finishes, then if you get an error at compile time then it halts) and is mathematically impossible. Doing it at runtime would have a massive performance impact
We rehash this argument every few weeks. Please search the comment history why it is nonsensical.
If they are bugs they should be reported to the user and end the compilation with an error.
Compilers actually have some options to enable that.

The problem is, it only works well in the simplest cases when the code will 100% exhibit UB within a single function.

In most cases, the UB would only manifest on particular input values - if you want your compiler to warn about that then it will report one "potential UB" for every 10 lines of C code, and nobody wants to use such a compiler.

The case of realloc being declared UB (as opposed to impl-defined) was not driven by the compiler writers but by the people who write the C libraries.

This isn't a case of compilers screwing over the programmers, because the people who are responsible for those optimizations are the people who are scratching their heads as to why it's UB and not impl-defined behavior.

UB can initiate the rise of zombie velociraptors.

  int n;
  printf("type 0 to stop the rise of zombie velociraptors");
  scanf("%d", &n);
  realloc(pre, n);
  if (n != 0) rise_zombie_velociraptors()
May result in velociraptors raising even if the user enters "0".

The reason is that because realloc(pre, 0) is UB, for the compiler, it cannot happen, so n can't be 0, so the n != 0 test can be optimized out, so, velociraptors.

> The second part is the misconception about the impact of UB. Making something UB does not dictate that its usage will initiate the rise of zombie velociraptors. It grants the implementation the power to decide the best course of action. That is, after all, what they’ve been doing all this time anyway.

Wrong. UB never happens. That is the promise the program writer makes to the compiler. UB never happens. A correct C program never executes UB. This allows the compiler to assume that anything that is UB never happens. Does some branch of your program unconditionally execute realloc(..., 0) after constant propagation? That branch never happens and can just be deleted.

Reading the defect report, they state "Classifying a call to realloc with a size of 0 as undefined behavior would allow POSIX to define the otherwise undefined behavior however they please." which is wrong. UB cannot be defined, if you define it, you are no longer writing standard C. It should instead have been classified as "implementation-defined behaviour".

In any case it's not that hard to just write a sane wrapper. This one is placed in the Public Domain:

    void *sane_realloc(void *ptr, size_t sz)
    {
        if (sz == 0) {
            free(ptr); /*free(NULL) is no-op*/
            return NULL;
        }
        if (ptr == NULL) {
            return malloc(sz);
        }
        return realloc(ptr, sz);
    }
I am calling it sane and not safe, because it is not safe. You still have the confusion of what happens when the function returns NULL (was it allocation failure or did we free the object?) - check errno. However, it has the same fully defined semantics on most all implementations and acts like people would expect.

You may be tempted to make the function return the value of errno, mark it [[nodiscard]] and take a pointer-to-pointer-to-void, so that the value of the pointer will only be changed if the reallocation was successful. I am not sure if that is safer. You are trading one possible bug - null pointer on allocation failure, which then will cause a segmentation fault for another - stale pointer on allocation failure, but with updated size. The latter is more likely to be used in buffer overflow attacks than the former.

> This is written with quite a lot of hyperbole

The first sight of "catch fire" might not have caught my attention, but by the time it got to "instrument of arson" and "Molotov cocktails", the style was sufficiently distracting that I was convinced I wasn't the intended audience.

My understanding was that they're changing realloc() because they previously allowed zero length arrays and because you can't tell if this is a zero length array you need to either get rid of zero length arrays or change realloc().

So the feature wasn't broken to begin with, it was broken by another feature.