I don’t work a ton with C, but I wonder how C programmers keep track of what behavior is and is not defined. It seems like there are many possible edge cases.
We get by on a combination of matching patterns (any pointer cast gets a lot of scrutiny, for example), compiler warnings, tools like UBSan, debugging when things go wrong, and sheer dumb luck.
Having an understanding of how the code gets transformed into machine code helps. For this case, there's the basic idea that `a++` will boil down to three basic conceptual operations: fetch, add, and store, and those can be potentially interleaved with other parts of the statement. In something like `a++ + ++b` the interleaving doesn't affect the outcome no matter how it's done. In `a++ + ++b` the interleaving can affect the outcome, and that's your sign that something might be wrong.
Any memory safety issue in C code had to involve UB at some point. And you can see how prevalent those are, and deduce how not-particularly-great we are at keeping track of UB.
> Having an understanding of how the code gets transformed into machine code helps
I'm not sure about that. Knowing assembly is not a substitute for knowing how the language is defined. Sometimes C/C++ programmers with some assembly knowledge reason themselves into thinking that what they're asking of the language must have well-defined behaviour, when in fact it's undefined behaviour. It doesn't really matter whether interleaving order can change the output. (++i)++ is, apparently [0], undefined behaviour in C but has well defined behaviour in C++.
I don't mean assembly in this case, but something more like the compiler's view of the code. a++ can be broken down into more primitive operations, and might actually be, depending on how the compiler is implemented. The fact that the ordering of those more primitive operations with respect to other operations isn't very tightly constrained is something you'd just have to know about the language, I suppose.
> The fact that the ordering of those more primitive operations with respect to other operations isn't very tightly constrained is something you'd just have to know about the language, I suppose.
No, that's not right. It's undefined behaviour, not merely an unspecified order of evaluation. Roughly speaking, the behaviour of the entire program is unconstrained by the language standard after execution of that statement. It could crash the whole process, for instance, or go haywire.
It's worse than that, the behavior of the entire program is unconstrained by the language standard beforehand too. Raymond Chen discusses how things can go wrong once you're going to reach UB even before you get to it: https://devblogs.microsoft.com/oldnewthing/20140627-00/?p=63...
Anyway, I didn't mean to imply that things behaved as written aside from ordering issues. I only meant that this sort of principle can help you remember where UB lurks. Generally, where a kind C compiler might just mess with your numbers a bit, an evil C compiler can legally make demons fly out of your nose.
> It's worse than that, the behavior of the entire program is unconstrained by the language standard beforehand too. Raymond Chen discusses how things can go wrong once you're going to reach UB even before you get to it
Heh, yes that's exactly what I was thinking when I put roughly speaking.
> where a kind C compiler might just mess with your numbers a bit, an evil C compiler can legally make demons fly out of your nose
Yes, signed integer overflow being another. Presumably it's defined that way as it's simpler than trying to spell out all the behaviours the compiler is permitted to implement, and on top of that there are trap representations to worry about. I doubt modern compilers get much optimization benefit from it though. There's a StackOverflow thread discussing the reasons it's defined this way: https://stackoverflow.com/q/1860461
They don't. In the culture some kinds of undefined behaviour are taken seriously and some aren't. If you want to write code that "works", you emulate what popular performance benchmarks do (whether their code is undefined according to the standard or not), since those are the thing that C compiler developers actually care about.
Personally, when ever I write a modifying statement, I wonder about the domains of the input and ensure, that the condition necessary to stay in the existing range is evaluated. If it is not, I either write the condition, reduce the input domain, or increase the output domain.
They don't really. In fact there are many things that are technically UB but are so common that compilers can't really treat them as UB. E.g. type punning via unions.
Type punning via unions is not UB in C in general, but it is in C++ IIRC.
I write "in general" because, as with other forms of memory reinterpretation (memcpy or copy through a character type), evaluating a trap representation triggers UB.
And a slightly longer version is, that there are three types involved: the type of access, the effective type of the object[0], and the type of the variable. The type of the variable is only for the compiler to emit warnings, as long as the effective type and the type of access are equal, it isn't UB.
Having an understanding of how the code gets transformed into machine code helps. For this case, there's the basic idea that `a++` will boil down to three basic conceptual operations: fetch, add, and store, and those can be potentially interleaved with other parts of the statement. In something like `a++ + ++b` the interleaving doesn't affect the outcome no matter how it's done. In `a++ + ++b` the interleaving can affect the outcome, and that's your sign that something might be wrong.
Any memory safety issue in C code had to involve UB at some point. And you can see how prevalent those are, and deduce how not-particularly-great we are at keeping track of UB.