Hacker News new | ask | show | jobs
by HippoBaro 680 days ago
I think the author knows very well what UB is and means. But he’s thinking critically about the whole system.

UB is meant to add value. It’s possible to write a language without it, so why do we have any UB at all? We do because of portability and because it gives flexibility to compilers writers.

The post is all about whether this flexibility is worth it when compared with the difficulty of writing programs without UB.

The author makes the case that (1) there seem to be more money lost on bugs than money saved on faster bytecode and (2) there’s an unwillingness to do something about it because compiler writers have a lot of weight when it comes to what goes into language standards.

4 comments

Even stipulating that part of the argument, the author then goes on a tear about optimizations breaking constant-time evaluation, which doesn’t have anything to do with UB.

The real argument seems to be that C compilers had it right when they really did embody C as portable assembly, and everything that’s made that mapping less predictable has been a regression.

But C never had been portable assembly.

Which I think is somewhat the core of the problem. People treating things in C in ways they just are not. Weather that is C is portable assembly or C the "it's just bit's in memory" view of things (which often is double wrong ignoring stuff like hardware caching). Or stuff like writing const time code based on assuming that the compiler probably, hopefully can't figure out that it can optimize something.

> The real argument seems to be that C compilers had it right when they really did embody C as portable assembly

But why would you use such a C. Such a C would be slow compared to it's competition while still prone to problematic bugs. At the same time often people seem to forgot that part of UB is rooted in different hardware doing different things including having behavior in some cases which isn't just a register/mem address having an "arbitrary value" but more similar to C UB (like e.g. when it involves CPU caches).

> But C never had been portable assembly.

The ANSI C standards committee disagrees with you.

"Committee did not want to force programmers into writing portably, to preclude the use of C as a “high-level assembler:”

https://www.open-std.org/JTC1/SC22/WG14/www/docs/n897.pdf

p 2, line 39. (p10 of the PDF)

"C code can be portable. "

line 30

The full quote is:

> Although it strove to give programmers the opportunity to write truly portable programs, the C89 Committee did not want to force programmers into writing portably, to preclude the use of C as a “high-level assembler:” the ability to write machine-specific code is one of the strengths of C. It is this principle which largely motivates drawing the distinction between strictly conforming program and conforming program (§4).

This doesn't say that C is a high-level assembly.

It just says that the committee doesn't (at that point in time) wants to force the usage of "portable" C as a mean to prevent the usage of C as high-level assembler. But just because some people use something as high level assembler doesn't mean it is high level assembly (like I did use a spoon as a fork once, it's still a spoon).

Furthermore the fact that they explicitly mention forcing portable C with the terms "to preclude" and not "to break compatibility" or similar I think says a lot about weather or not the committee thought of C as high level assembly.

Most importantly the quote is about the process of making the first C standard which had to make sure to ease the transition from various non standardized C dialects to "standard C" and I'm pretty sure that through the history there had been C dialects/compiler implementations which approached C as high level assembly, but C as in "standard C" is not that.

It specifically says that the use of C as a "portable assembler" is a use that the standards committee does not want to preclude.

Not sure how much clearer this can be.

That statement means the comittee does not want to stop it from being developed. The question is, has it? They mean a specific implementation could work as portable assembler, mirroring djb's request for an 'unsurprising' C compiler. Another interpretation would be in the context of CompCert, which has been developed to achieve semantic preservation between assembly and its source. Interestingly this of course hints at verifying an assembled snippet coming from some other source as well. Then that alternate source for the critical functions frees the rest of compiler internals from the problems of preserving constant-timeness and leakfreedom through their passes.
These are aspiration statements, not a factual judgment of what that standard or its existing implementations actually are. At least they do not cover all implementations nor define precisely what they cover. Note the immediate next statement: "C code can be non-portable."

In my opinion, C has tried to serve two masters and they made a screw-hammer in the process.

The rest of the field has moved on significantly. We want portable behavior, not implementation-defined vomit that will leave you doubting whether porting introduces new UB paths that you haven't already fully checked against (by, e.g. varying the size of integers in such a way some promotion is changed to something leading to signed overflow; or bounds checking is ineffective).

The paragraph further down about explicitly and swiftly rejecting a validation test suite should also read as a warning. Not only would the proposal of modern software development without a test suite get you swiftly fired today, but they're explicitly acknowledging the insurmountable difficulties in producing any code with consistent cross-implementation behavior. But in the time since then, other languages have demonstrated you can reap many of the advantages of close-to-the-metal without compromising on behavior consistency in cross-target behavior, at least for many relevant real-word cases.

They really knew what they were building, a compromise. But that gets cherry-picked into absurdity such as stating C is portable in present-tense or that any inherent properties make it assembly-like. It's neither.

These are statements of intent. And the intent is both stated explicitly and also very clear in the standard document that the use as a "portable assembler" is one of the use cases that is intended and that the language should not prohibit.

That does not mean that C is a portable assembly language to the exclusion of everything and anything else, but it also means the claim that it is definitely in no way a portable assembly language at all is also clearly false. Being a portable assembly (and "high level" for the time) is one of the intended use-cases.

> In my opinion, C has tried to serve two masters and they made a screw-hammer in the process.

Yes. The original intent for which it was designed and in which role it works well.

> The rest of the field has moved on significantly. We want portable behavior, not implementation-defined vomit that will leave you doubting whether porting introduces new UB paths that you haven't already fully checked against

Yes, that's the "other" direction that deviates from the original intent. In this role, it does not work well, because, as you rightly point out, all that UB/IB becomes a bug, not a feature.

For that role: pick another language. Because trying to retrofit C to not be the language it is just doesn't work. People have tried. And failed.

Of course what we have now is the worst of both worlds: instead of either (a) UB serving its original purpose of letting C be a fairly thin and mostly portable shell above the machine, or (b) eliminating UB in order to have stable semantics, compiler writers have chosen (c): exploiting UB for optimization.

Now these optimizations alter program behavior, sometimes drastically and even impacting safety (for example by eliminating bounds checks that the programmer explicitly put in!), despite the fact that the one cardinal rule of program optimization is that it must not alter program behavior (except for execution speed).

The completely schizophrenic "reasoning" for this altering of program behavior being somehow OK is that, at the same time that we are using UB to optimize all over the place, we are also free to assume that UB cannot and never does happen. This despite the fact that it is demonstrably untrue. After all UB is all over the C standard, and all over real world code. And used for optimization purposes, while not existing.

> They really knew what they were building, a compromise.

Exactly. And for the last 3 decades or so people have been trying unsuccessfully to unpick that compromise. And the result is awful.

The interests driving this are also pretty clear. On the one hand a few mega-corps for whom the tradeoff of making code inscrutable and unmanageable for The Rest of Us™ is completely worth it as long as it shaves off 0.02% running time in the code they run on tens or hundreds of data centers and I don't know how many machines. On the other hand, compiler researchers and/or open-source compiler engineers who are mostly financed by those few megacorps (the joy of open-source!) and for whom there is little else in terms of PhD-worthy or paid work to do outside of that constellation.

I used to pay for my C compiler, thus there was a vendor and I was their customer and they had a strong interest in not pissing me off, because they depended on me and my ilk for their livelihood. This even pre-dated the first ANSI-C standard, so all the compiler's behavior was UB. They still didn't pull any of the shenanigans that current C compilers do.

Back in 1989, when C abstract machine semantics were closer to being a portable macro processor, and stuff like the register keyword was actually something compilers cared about.
And even then there was no notion of constant-time being observable behavior to the compiler. You cannot write reliably constant-time code in C because execution time is not a property the C language includes in its model of computation.
But having a straightforward/predictable mapping to the underlying machine and its semantics is included in the C model of computation.

And that is actually not just compatible with the C "model of computation" being otherwise quite incomplete, these two properties are really just two sides of the same coin.

The whole idea of an "abstract C machine" that unambiguously and completely specifies behavior is a fiction.

Nobody says that implementation-defined behavior must be sane or safe. The crux of the issue is that a compiler can assume that UB never happens, while IB is allowed to. Does anyone have an example where the assumption that UB never happens actually makes the program faster and better, compared to UB==IB?
The issue is that you’d have to come up with and agree on an alternative language specification without (or with less) UB. Having the compiler implementation be the specification is not a solution. And such a newly agreed specification would invariably either turn some previously conforming programs nonconforming, or reduce performance in relevant scenarios, or both.

That’s not to say that it wouldn’t be worth it, but given the multitude of compiler implementations and vendors, and the huge amount of existing code, it’s a difficult proposition.

What traditionally has been done, is either to define some “safe” subset of C verified by linters, or since you probably want to break some compatibility anyway, design a separate new language.

> UB is meant to add value. It’s possible to write a language without it, so why do we have any UB at all? We do because of portability and because it gives flexibility to compilers writers.

Implementation-defined behavior is here for portability for valid code. Undefined behavior is here so that compilers have leeway with handling invalid conditions (like null pointer dereference, out-of-bounds access, integer overflows, division by zero ...).

What does it mean that a language does not have UBs? There are several cases how to handle invalid conditions:

1) eliminate them at compile time - this is optimal, but currently practical just for some classes of errors.

2) have consistent, well-defined behavior for them - platforms may have vastly different way how to handle invalid conditions

3) have consistent, implementation-defined behavior for them - usable for some classes of errors (integer overflow, division by zero), but for others it would add extensive runtime overhead.

4) have inconsistent behavior (UB) - C way

> It’s possible to write a language without it

Whenever you do that, programmers deride the language for being "excessively academic" or something