Hacker News new | ask | show | jobs
by petergeoghegan 1362 days ago
> While I do sympathize with some of the user complaints with UB, and the issues with things like signed integer overflow and strict aliasing seem entirely gratuitous, I think most users complaining about UB fail to comprehend that the issue with UB is that it's often really hard to constrain just what can possibly go wrong--and that's even without compiler optimizations kicking into play.

That's probably true, but compiler people do themselves no favors by pretending that these things come from some higher echelon, that they couldn't possibly presume to question. It just doesn't pass the smell test.

The fact that -wfrapv and -Wno-strict-aliasing are not the defaults in GCC is a choice made by GCC. A bad choice, in my opinion. MSVC made different choices, and lots of people still use it, so there is an existence proof that you can just not do these things on a mainstream compiler. (In fact, MSVC doesn't even offer type-based aliasing as an option that can be enabled, last I checked.)

1 comments

How much performance gets left on the table as you disable ever more optimisations though?

The justification for monstrously unsafe languages like C was that they're faster. If after removing optimisations which are too tricky to write for they're no longer faster then the languages don't pay their way any more and there's no reason to use them.

I was expecting it would be easy to find benchmarks trying the same C or C++ code with GCC, Clang and MSVC and giving performance numbers, but I didn't find that. Maybe it exists and I can be directed to it ?

> The justification for monstrously unsafe languages like C was that they're faster.

I don't think that that's true. I find the explanation given by "Some Were Meant for C" [1] far more plausible.

But leaving that aside: what does that have to do with anything that I said? And might I be permitted to make a point about GCC that is wholly unrelated to Rust, without getting a generic lecture about memory safety?

> I was expecting it would be easy to find benchmarks trying the same C or C++ code with GCC, Clang and MSVC and giving performance numbers, but I didn't find that. Maybe it exists and I can be directed to it ?

I don't doubt that there are silly compiler microbenchmarks somewhere. And I know for sure that strict aliasing could in principle make a huge difference. For example, an autovectorization optimization could take place once the compiler had leeway to applying an assumption about two pointers not aliasing, but not otherwise.

However, in practice it doesn't seem to make all that much difference for most kinds of C programs, for all kinds of reasons that are very difficult to pin down. The big exceptions generally involve numerical code, which is why Fortran has always tended to be faster than C for numerical applications. At least it definitely was for most of the history of both languages. (Yes, C was openly understood to be slower than Fortran in cases that were important for Fortran 40+ years ago. I refer you to [1] once more.)

[1] https://www.cs.kent.ac.uk/people/staff/srk21//research/paper...

> I don't think that that's true. I find the explanation given by "Some Were Meant for C" [1] far more plausible.

Kell's argument rests on the idea that C is doing something you can't do in safe languages. Mostly it says the safe languages are managed, and so they simply can't match C for what Kell describes as "communicativity" which we will see shortly isn't meaningfully true.

Now, one of the first examples Kell gives (from Duff's device) is something Dennis Ritchie thought was a huge mistake, the volatile qualifier. In C we can do MMIO the same way as in machine code, we refer to a volatile "object" that's actually not really in memory, the CPU fetches and stores will be issued and the MMIO happens. In a language like Rust this doesn't work, they have intrinsics which actually emit the same machine code, but there is no pretend object.

Kell, like many C programmers, thinks C is revealing an important truth here, but it's actually keeping up a damaging masquerade. That object is a lie, to the extent we write code which pretends it's real that code is misleading and sooner or later a maintenance programmer will believe the lie and cause problems. MMIO is similar to memory access not because of some "truth" but just because it was technically convenient.

[[ The other notable thing the volatile qualifier is (ab)used for is once again in MSVC. Microsoft semantics for volatile are atomic Acquire-release so you can use it for IPC and even within a concurrent program. That's not what it was for on Unix, it's not what the standard says it does, but it's how it happened to work on a single x86 CPU so that's what MSVC provides even today, and even on ARM if you accept the resulting performance penalty. ]]

But even outside of volatile, which again is definitely a bad idea, Kell says C is better than the safe languages because of communicativity, his next examples are just data in memory, but they are "foreign" to the software. C will be able to access this data as raw bytes.

If your experience of managed languages is, as Kells' seems to have been, a Lisp, then maybe the ability to access bytes of memory is remarkable. But you don't need Rust to do this from a safe language. C# isn't just safe it's a managed garbage collected language, yet it can do access to a byte slice just as well as C. Given a slice of bytes, step through it, one instruction at a time (with the size of an instruction to be established by a separate function) and ask a callback to look at those instructions. No problem in C#.

The false belief that everything is just C anyway causes significant grief. C programmers tend to see text and think this means C's weird NUL-termination rule applies. So given the string "microsoft.com\0bo-chicken.example.com" and asked whether it's exactly "microsoft.com" the C code says yes, opening a security vulnerability. This really happened, and yet of course C programmers insisted it's somebody else's fault. And that's key here, Kell's examples are "foreign" in the sense that they aren't from this C program. But these examples aren't structures from Java, or Scheme, or even Rust, they're just from a different C program. This "communicativity" is unimaginitive.

It reminds me of colonial Europeans passing judgement on the "savage" inhabitants of a place they've now decided belongs to them. Why don't these useless Indians have properly 0-byte terminated strings? And why are their symbol names so complicated? No no, this is all wrong, the correct way is the way I grew up with, and any alternative is not better, nor even just different, but necessarily wrong.

The "safety" Kell envisions for a hypothetical safer C implementation relies on saying that behaviours which have been passionately defended here by C programmers including yourself as needing to be defined are Undefined and so can safely be outlawed. Aliasing between types? Kell says you shouldn't expect that to work and so it shan't in his safe C.

Finally Kell comes back to is an assumption which is small in C but grew enormous in C++ about what we should do if the programs might be nonsense. Rice's Theorem says we can't necessarily decide if semantic properties hold for arbitrary programs. To defuse this problem we sometimes give up instead. Thus, a program to decide if another program is correct (such as a compiler) will have three possible results, Correct, Wrong and Not Sure. It's obvious what to do with the first two categories but that leaves us with Not Sure. In both C and C++ the answer is those programs go in the Correct bucket anyway and we cross our fingers.

Rust says no, all the "Not sure" programs go in the "Wrong" bucket and the programmer can just modify the program until it's Correct so we're fine. This is a pragmatic engineering response. Our dynamic analysis says this new bridge might literally explode, I think the analysis method may be wrong about that, but I can't prove it. Should we build the bridge anyway and see if it explodes? No! Design a bridge that passes analysis!

The Rust and C++ approaches have very different incentives for language designers. In Rust the incentive is to shrink that "No sure" pile, because it annoys programmers. Non-lexical lifetimes are an example of that shrinking process. A Rust program with NLL was always actually fine, but before the NLL change landed in the compiler it would be rejected because it was in the "Not sure" category. In C++ the incentive is to grow the "Not sure" category because every program we can convince ourselves might be correct is another C++ program in the Correct bucket. Some, perhaps many, perhaps even most are nonsense, but at least some of them are correct and we don't care to distinguish. The same tendency exits, to a lesser extent, in C itself.

> It reminds me of colonial Europeans passing judgement on the "savage" inhabitants of a place they've now decided belongs to them

Okay!

Now you made me doubt myself, because I'm pretty sure that analogy was the weakest part of what I wrote, it was just how it seemed to me when I was writing.

On further reflection I think maybe reputation for performance matters rather than performance itself. But I have strayed far off topic.

> On further reflection I think maybe reputation for performance matters rather than performance itself

I think that you're vastly overestimating the importance C as an abstract specification and as a community of programmers with a coherent set of shared goals. You're also too focussed on performance. There is a practical sense in which C will tend to perform better for certain tasks, but it doesn't necessarily have all that much to do with the language itself. It's the whole ecosystem. And yes, path dependence matters. It isn't intrinsically true that it has to be this way, a little like how it isn't intrinsically true that we have to use QWERTY keyboards instead of Dvorak keyboards.

It's not that there aren't lots of serious problems with C -- there certainly are. It's that those problems are systemic problems; they're more the result of a huge number of people making a huge number of pragmatic decisions, day after day, year after year -- and the sequence matters. Many of these people are not computer programmers. Many are from hardware vendors that have people that sit on standards bodies for everything from NVMe to RISC-V. These are all people that more or less all look at the world as it actually is today, and build on that incrementally. They build accretions on top of accretions.

There are many glaring contradictions in C. Depending on who you ask, it's either a portable assembler, or a programming language that targets something called the C abstract machine. And neither party seems to want to even address the glaring inconsistency! This is a cabal that seems to have a real problem with staying on message, don't you think?

I make only very modest claims here. I'm not saying that this is good or inevitable; only that it is the best explanation I am aware of. I'm definitely not saying that we can't do better. Only that I believe that the current state of affairs works as well as it does (i.e. barely adequately) because in the end it's very difficult to get an enormous number of people separated by time and space to agree on anything at all. C more or less remains the defacto standard when operating at the hardware/software interface not in spite of its glaring contradictions. It's because of them.

It's all but impossible for me to prove any of this, because I'm describing diffuse, emergent behavior -- what I'm arguing is that things tend to take the path of least resistance, in an environment where companies come and go, and short term business considerations hold sway. I might be willing to put more effort into convincing you of this if I really was the C zealot that you imagine me to be, but I'm not.