Hacker News new | ask | show | jobs
by dmcdm 2540 days ago
I found it an amusing excrcise, if not terribly relevant, even as someone who spends 90% of his dev time in C.

What rubs me about these sorts of articles is they make some presumption about the importance and nessecisity of writing truely portable C, as if the "C Standard" were in and of itself a terribly useful tool. This is in contrast to where I live most of the time which is "GCC as an assembler macro language" (for a popular exposition on this subject see https://raphlinus.github.io/programming/rust/2018/08/17/unde...). And yeah, reading through the problem set I was critiquing it in context of my shop's standards, where we might be packing and padding, using cacheline alignment, static assertions about sizeof things, specific integer types, etc. So these sorts of articles just come off as a little pendantic to folks like me. I don't doubt they're useful for some folks, and I guess it's interesting to come up from the depths of non-standard GNU extensions and march= flags to see what I take for granted.

6 comments

It's very much worth reading, Linus Torvalds' opinion of standards that's linked in that article, but I'll link it again here: https://lkml.org/lkml/2018/6/5/769

"So standards are not some kind of holy book that has to be revered. Standards too need to be questioned."

The way I see it, a lot of compiler writers are basically taking the standard as gospel and ignoring everything else "because the standard doesn't say we can't" --- and that's a huge problem, because behaviour that the standard doesn't define often has a far more common-sense meaning that programmers expect. IMHO the onus should really be on the authors of compilers to find that reasonable meaning. In fact, the standard even suggests that one possible undefined behaviour is something like "behave in a manner characteristic of the environment" (can't remember nor be bothered looking up the standard.)

This is a common misconception. Compiler authors don't exploit undefined behavior to make themselves seem smart, or because they like breaking code. They exploit undefined behavior because somebody filed a bug saying some code was slow, and exploiting UB was the simplest way--or, in many cases, the only way--to fix the performance problem.

GCC and Clang do give you the option to avoid optimizations based on undefined behavior: compile at -O0. We think of the low-level nature of C as being good for optimization, but in many cases the C language as people expect it to work is at odds with fast code.

It's fascinating to actually dive into the specific instances of undefined behavior exploitation that get the most complaints. In each such case, there is virtually always a good reason for it. For example, treating signed overflow of integers as UB is important to avoid polluting perfectly ordinary loops with movsx instructions everywhere on x86-64. It's easy to see why compiler developers added these optimizations: someone filed a bug saying "hey, why is my loop full of movsx", and the developers fixed the problem.

Edit: Should be movsx instead of movzx, sorry.

Could you go into a little bit more detail regarding the movzx? Aren't 32-bit registers always zero-extended on x86-64?
Sure. Here's an in-depth explanation from Fabian Giesen: https://gist.github.com/rygorous/e0f055bfb74e3d5f0af20690759...
Thanks, rygorous is always a great read - although sometimes a little overwhelming. If I got the gist of it, I have a small correction to your comment: the issue is about movsxd (sign extended integer indexes), not movzx (zero extension).
It's easy to see why compiler developers added these optimizations: someone filed a bug saying "hey, why is my loop full of movsx", and the developers fixed the problem.

"fixed" by breaking other expectations. Regardless of what the spec says, that's still a stupid way to do things. There's a child comment below which examines this case in detail; and the real solution is to make the analysis better, not use UB as a catch-all excuse.

> compiler writers are basically taking the standard as gospel

I would be rather disappointed if they didn't, honestly.

Consider the following statements:

1) The standard says I must do this, so I must do it.

2) The standard doesn't say I must not do this (but does allow me to either do it or not do it), so it's totally OK if I do it.

I think you're thinking of cases covered by statement 1, and I think pretty much everyone agrees that compiler writers should behave that way for the standard to mean anything.

The issues arise in cases covered by statement 2. Just because the standard allows a behavior doesn't mean that the behavior is a good one. And yes, code relying on you not having the behavior is not following the standard, and that's something the authors of that code should consider addressing. But on the other hand, the standard may allow a lot of behaviors that only make sense in some situations but not others (totally true of the C standard, depending on the underlying hardware) and as a compiler writer you should think carefully about what behaviors you actually want to implement.

AS a concrete example, you _could_ write a C compiler targeting x86-64 which has sizeof(uint64_t) == 1, sizeof(unsigned int) == 1, sizeof(unsigned long) == 2, and sizeof(unsigned long long) == 2 (so 64-bit char, 64-bit short, 64-bit int, 128-bit long, 128-bit long long). Would this be a good idea? Probably not, unless you are trying to use it as a way to test for bugs in code that you will want to run on an architecture where those sizes would actually make sense...

It's a collective action problem. If we want to give up runtime performance and get stronger guarantees about what code will be understood to mean, we should revise the standard and start using new optimizers that respect it. If every compiler goes its own way, I only benefit from what they already agreed on.
GCC and many other compilers have been known to change the consequences of undefined behavior unpredictably when upgrading, changing compiler flags, etc. For some examples that matters.
Knowing what the standard says and keeping to it as much as possible is important because every now and then, a major compiler finds some exciting new way to optimise code based on undefined behaviour, and breaks code that assumed GCC would always do some seemingly obvious reasonable thing it did when the author tested it.
If you use C as an assembler macro language, you aren't actually writing C. You're likely to get burned someday, unless you compile at -O0.
> as if the "C Standard" were in and of itself a terribly useful tool

Not necessarily, I took it to mean that engineering is holistic and things like compiler behavior in the face of undefined parts of the standard are important to account for.

Where the author goes wrong is in assuming that somehow "I don't know" can be a final answer to these things. No, it is absolutely fucking vital that you know how the compiler will pad your structures in C. Similarly to the "what size is an int" on your architecture - on an ATmega8 this is 16 bit, but the chip can't actually do all 16 bit operations in single instructions.
I took that to be the point of the article though, that just looking at the code wasn't enough to know and you needed to go further to answer these cases for your exact use case or target platform.
Further: Unless your code is compiled, deployed to a rocket, and fired off the Earth never to return, the question of “what is my platform?” is meaningless in the context of writing good C.

So, today, using the compiler installed on your system right now, sizeof(int) = 32. Great. That means nothing, and changes nothing about whether your code is correct. You should not write code relying on it. Just like you should not measure the output of the questions on this test, and declare that you know what the answers are.

>Unless your code is compiled, deployed to a rocket, and fired off the Earth never to return, the question of “what is my platform?” is meaningless in the context of writing good C.

While I feel the tone of your comparison was intended to be a bit hyberbolic, the reality is a bulk of modern C development occurs in a context similar to the one you describe. Further the thought, utterly foreign to the vast majority of software developers, that the physical machine may not be some utterly abstract and constantly mutating target which there is no hope of understanding is, imo, one of the great dying arts of software engineering - a death perpetuated by the same sort of folks who think CS education should be carried on in Java.

I contend that, these days, most C is written to target a particular compiler, physical machine, and/or device.

There is vastly more old C code than new, and it didn't target the x64 or ARM architectures it's running on now. Where it wasn't portable, that was a defect that had to be fixed.

My first job was a 4GL targeting customers running DOS on the 80286, complete with runtime linking. 100% of that work has been abandoned due to incompatibility. It contributed nothing to the profession beyond what I personally learned.

There is a Mac program BBEdit that was first written to target 68K 32 bit Macs, then PPC 32 bit Macs, then 32 bit x86 Macs and then 64 bit Macs. Probably within the next 3 years it will target ARM Macs.

The author said he never did a full scale rewrite. He slowly migrated code from one platform to the next.

Today, Apple’s code runs on both ARM and x86 and with Marzipan, as will developers code. True most will be in Objective C, but some low level code is still in C.

I hope I'm being on topic and reasonable to point out that the result of the sizeof operator is in "number of chars", not bits.
This is why, decades ago, the C world moved on, and added types like int32_t and size_t, so programmers can say what they mean.