Hacker News new | ask | show | jobs
by wat10000 42 days ago
We get by on a combination of matching patterns (any pointer cast gets a lot of scrutiny, for example), compiler warnings, tools like UBSan, debugging when things go wrong, and sheer dumb luck.

Having an understanding of how the code gets transformed into machine code helps. For this case, there's the basic idea that `a++` will boil down to three basic conceptual operations: fetch, add, and store, and those can be potentially interleaved with other parts of the statement. In something like `a++ + ++b` the interleaving doesn't affect the outcome no matter how it's done. In `a++ + ++b` the interleaving can affect the outcome, and that's your sign that something might be wrong.

Any memory safety issue in C code had to involve UB at some point. And you can see how prevalent those are, and deduce how not-particularly-great we are at keeping track of UB.

1 comments

> Having an understanding of how the code gets transformed into machine code helps

I'm not sure about that. Knowing assembly is not a substitute for knowing how the language is defined. Sometimes C/C++ programmers with some assembly knowledge reason themselves into thinking that what they're asking of the language must have well-defined behaviour, when in fact it's undefined behaviour. It doesn't really matter whether interleaving order can change the output. (++i)++ is, apparently [0], undefined behaviour in C but has well defined behaviour in C++.

[0] https://stackoverflow.com/a/58841107

I don't mean assembly in this case, but something more like the compiler's view of the code. a++ can be broken down into more primitive operations, and might actually be, depending on how the compiler is implemented. The fact that the ordering of those more primitive operations with respect to other operations isn't very tightly constrained is something you'd just have to know about the language, I suppose.
> The fact that the ordering of those more primitive operations with respect to other operations isn't very tightly constrained is something you'd just have to know about the language, I suppose.

No, that's not right. It's undefined behaviour, not merely an unspecified order of evaluation. Roughly speaking, the behaviour of the entire program is unconstrained by the language standard after execution of that statement. It could crash the whole process, for instance, or go haywire.

(Again, that's in C, apparently, but not in C++.)

It's worse than that, the behavior of the entire program is unconstrained by the language standard beforehand too. Raymond Chen discusses how things can go wrong once you're going to reach UB even before you get to it: https://devblogs.microsoft.com/oldnewthing/20140627-00/?p=63...

Anyway, I didn't mean to imply that things behaved as written aside from ordering issues. I only meant that this sort of principle can help you remember where UB lurks. Generally, where a kind C compiler might just mess with your numbers a bit, an evil C compiler can legally make demons fly out of your nose.

> It's worse than that, the behavior of the entire program is unconstrained by the language standard beforehand too. Raymond Chen discusses how things can go wrong once you're going to reach UB even before you get to it

Heh, yes that's exactly what I was thinking when I put roughly speaking.

> where a kind C compiler might just mess with your numbers a bit, an evil C compiler can legally make demons fly out of your nose

Yes, signed integer overflow being another. Presumably it's defined that way as it's simpler than trying to spell out all the behaviours the compiler is permitted to implement, and on top of that there are trap representations to worry about. I doubt modern compilers get much optimization benefit from it though. There's a StackOverflow thread discussing the reasons it's defined this way: https://stackoverflow.com/q/1860461

Apparently signed integer overflow UB helps with loop optimizations because it makes it easy to prove the loop always terminates. I assume that's not why it's UB, though; surely it's UB because some systems trapped on overflow, or produced different results due to using 1's complement, and the optimization side of the rule was a happy accident. There's a lot of history in this language and it really shows.