Hacker News new | ask | show | jobs
by userbinator 2542 days ago
It's very much worth reading, Linus Torvalds' opinion of standards that's linked in that article, but I'll link it again here: https://lkml.org/lkml/2018/6/5/769

"So standards are not some kind of holy book that has to be revered. Standards too need to be questioned."

The way I see it, a lot of compiler writers are basically taking the standard as gospel and ignoring everything else "because the standard doesn't say we can't" --- and that's a huge problem, because behaviour that the standard doesn't define often has a far more common-sense meaning that programmers expect. IMHO the onus should really be on the authors of compilers to find that reasonable meaning. In fact, the standard even suggests that one possible undefined behaviour is something like "behave in a manner characteristic of the environment" (can't remember nor be bothered looking up the standard.)

2 comments

This is a common misconception. Compiler authors don't exploit undefined behavior to make themselves seem smart, or because they like breaking code. They exploit undefined behavior because somebody filed a bug saying some code was slow, and exploiting UB was the simplest way--or, in many cases, the only way--to fix the performance problem.

GCC and Clang do give you the option to avoid optimizations based on undefined behavior: compile at -O0. We think of the low-level nature of C as being good for optimization, but in many cases the C language as people expect it to work is at odds with fast code.

It's fascinating to actually dive into the specific instances of undefined behavior exploitation that get the most complaints. In each such case, there is virtually always a good reason for it. For example, treating signed overflow of integers as UB is important to avoid polluting perfectly ordinary loops with movsx instructions everywhere on x86-64. It's easy to see why compiler developers added these optimizations: someone filed a bug saying "hey, why is my loop full of movsx", and the developers fixed the problem.

Edit: Should be movsx instead of movzx, sorry.

Could you go into a little bit more detail regarding the movzx? Aren't 32-bit registers always zero-extended on x86-64?
Sure. Here's an in-depth explanation from Fabian Giesen: https://gist.github.com/rygorous/e0f055bfb74e3d5f0af20690759...
Thanks, rygorous is always a great read - although sometimes a little overwhelming. If I got the gist of it, I have a small correction to your comment: the issue is about movsxd (sign extended integer indexes), not movzx (zero extension).
It's easy to see why compiler developers added these optimizations: someone filed a bug saying "hey, why is my loop full of movsx", and the developers fixed the problem.

"fixed" by breaking other expectations. Regardless of what the spec says, that's still a stupid way to do things. There's a child comment below which examines this case in detail; and the real solution is to make the analysis better, not use UB as a catch-all excuse.

> compiler writers are basically taking the standard as gospel

I would be rather disappointed if they didn't, honestly.

Consider the following statements:

1) The standard says I must do this, so I must do it.

2) The standard doesn't say I must not do this (but does allow me to either do it or not do it), so it's totally OK if I do it.

I think you're thinking of cases covered by statement 1, and I think pretty much everyone agrees that compiler writers should behave that way for the standard to mean anything.

The issues arise in cases covered by statement 2. Just because the standard allows a behavior doesn't mean that the behavior is a good one. And yes, code relying on you not having the behavior is not following the standard, and that's something the authors of that code should consider addressing. But on the other hand, the standard may allow a lot of behaviors that only make sense in some situations but not others (totally true of the C standard, depending on the underlying hardware) and as a compiler writer you should think carefully about what behaviors you actually want to implement.

AS a concrete example, you _could_ write a C compiler targeting x86-64 which has sizeof(uint64_t) == 1, sizeof(unsigned int) == 1, sizeof(unsigned long) == 2, and sizeof(unsigned long long) == 2 (so 64-bit char, 64-bit short, 64-bit int, 128-bit long, 128-bit long long). Would this be a good idea? Probably not, unless you are trying to use it as a way to test for bugs in code that you will want to run on an architecture where those sizes would actually make sense...

It's a collective action problem. If we want to give up runtime performance and get stronger guarantees about what code will be understood to mean, we should revise the standard and start using new optimizers that respect it. If every compiler goes its own way, I only benefit from what they already agreed on.