Hacker News new | ask | show | jobs
by ajross 648 days ago
I think the problem is sort of a permutation of this argument: way way too much attention is being paid to warning about the dangers and inadequacies of the standard's UB corners, and basically none to a good faith effort to clean up the problem.

I mean, it wouldn't be that hard in a technical sense to bless a C dialect that did things like guarantee 8-bit bytes, signed char, NULL with a value of numerical zero, etc... The overwhelming majority of these areas are just spots where hardware historically varied (plus a few things that were simple mistakes), and modern hardware doesn't have that kind of diversity.

Instead, we're writing, running and trying to understand tools like UBSan, which is IMHO a much, much harder problem.

3 comments

The exercise you suggest is futile. You've assumed that all these C programmers are writing software with a clear meaning and we just need to properly translate it so that the meaning is delivered.

There were C programmers like that, most of them now write Rust. They write what they meant, in Rust it just does what they wrote, they're happy.

But a large number - by now a majority of the die-hard C programmers - don't want that. They want to write nonsense and have it magically work. They don't need a new C dialect or a better compiler, or anything like that, they need fairy tale magic.

> There were C programmers like that, most of them now write Rust.

Please don't. There's a space in the world for language flames. But the real world is filled with people trying to evolve existing codebases using tools like ubsan, and that's what I'm talking about.

I don't see how this constitutes a language flame. I spent decades writing C. Twenty years ago I'd have agreed that it was worth trying to "fix" C, today I just write Rust instead.
I'm a huge proponent of UBSan and ASan. Genuine curiosity, what don't you like about them?

FWIW, there once was a real good-faith effort to clean up the problems, Friendly C by Prof Regehr, https://blog.regehr.org/archives/1180 and https://blog.regehr.org/archives/1287 .

It turns out it's really hard. Let's take an easy-to-understand example, signed integer overflow. C has unsigned types with guaranteed 2's complement rules, and signed types with UB on overflow, which leaves the compiler free to rewrite the expression using the field axioms, if it wants to. "a = b * c / c;" may emit the multiply and divide, or it can eliminate the pair and replace the expression with "a = b;".

Why do we connect interpreting the top bit as a sign bit with whether field axiom based rewriting should be allowed? It would make sense to have a language which splits those two choices apart, but if you do that, either the result isn't backwards compatible with C anyways or it is but doesn't add any safety to old C code even as it permits you to write new safe C code.

Sometimes the best way to rewrite an expression is not what you'd consider "simplified form" from school because of the availability of CPU instructions that don't match simple operations, and also because of register pressure limiting the number of temporaries. There's real world code out there that has UB in simple integer expressions and relies on it being run in the correct environment, either x86-64 CPU or ARM CPU. If you define one specific interpretation for the same expression, you are guaranteed to break somebody's real world "working" code.

I claim without evidence that trying to fix up C's underlying issues is all decisions like this. That leads to UBSan as the next best idea, or at least, something we can do right now. If nothing else it has pedagogical value in teaching what the existing rules are.

I'm not sure I understand your example. Existing modern hardware is AFAICT pretty conventionally compatible with multiply/divide operations[1] and will factor equivalently for any common/performance-sensitive situation. A better area to argue about are shifts, where some architectures are missing compatible sign extension and overflow modes; the language would have to pick one, possibly to the detriment of some arch or another.

But... so what? That's fine. Applications sensitive to performance on that level are already worrying about per-platform tuning and always have been. Much better to start from a baseline that works reliably and then tune than to have to write "working" code you then must fight about and eventually roll back due to a ubsan warning.

[1] It's true that when you get to things like multi-word math that there are edge cases that make some conventions easier to optimize on some architectures (e.g. x86's widening multiply, etc...).

The issue is that would not be backwards-compatible with all existing code. People might have to actually fix their programs to work reliably on the hardware they're using. That's almost always considered unacceptable. Also there are still lots of projects using C89, where `gets()` still exists. It got removed in 2011, but if you compile with -std=c89 or -std=c99 it still works, 36 years after the Morris worm should have taught everyone better!

The C standard developers did guarantee 8-bit bytes for C23, so maybe in 50 years that'll be the default C version.

> The issue is that would not be backwards-compatible with all existing code

Well, by definition it would, since the only behavior affected is undefined. But sure, in practice. Any change to the language is a danger, and ISO C is very conservative.

I'm just saying that the mental bandwidth of dealing with the old UB mess is now much higher than what it would take to try to fix it. All those cycles of "run ubsan on this old codebase and have endless arguments about interpretation and false positives[1]" could be trivially replaced with "port old codebase to C-unundefined standard", which would put us in a much better position.

[1] Trust me, I've been in these bikesheds and it's not pretty.