Hacker News new | ask | show | jobs
by trumpdong 5 days ago
There's lots of UB in C-family execution models. Some of which is not actually UB because the implementation defines it - e.g. aligned DWORD-sized memory access is atomic on Windows because Microsoft said it is.

By choosing to use this language you choose to navigate the UB. Otherwise you'd be writing in Go, or Python.

It is possible to write reliable code despite the presence of UB in a language just like it's possible to drive to work every day for 20 years despite most of the directions you can point the car leading to an immediate crash. That's a needle with a much thinner eye than UB in C, and most people manage it. Mainly it means being very careful about lifetime and ownership. The Linux kernel manages it 99% of the time simply by being careful about lifetime and ownership, and that's a project with a huge number of contributors who don't intimately know each other's modules. I'm the Linux kernel you can't just say "new whatever" - you must have a plan for a lifetime of that whatever, and other people will review it.

I agree with you about std::span.

3 comments

Yeah but also, quick question:

  struct S {
      char c;
      int i;
  };

  struct S a = {0};
  struct S b = {0};

  memcmp(&a, &b, sizeof(a)) == ...
If you answered 0, you'd be wrong, the answer is undefined, thanks to padding, initialization and alignment rules. Padding bytes are undefined, and not guaranteed to be initialized to zero even if the variable is declared static (where the members would be zeroed).

This is why the compiler is angry at the post writer, and why the reinterpret_cast is needed. Ideally if they wanted to do something with the data, they'd unbox the structure.

That's why it's not a good idea to use void* to pass arbitrary data interchangeable with bytes. It's a location, it makes no representation as to what's there and how to interact with it. Let alone who owns it.

std::span solves two problems here. One is the ownership problem. The other is that span<T> is a T[]. void* is god only knows.

The post asserts:

> The code is very clear and straightforward: you pass a pointer to the custom data structure, and its size in bytes. That’s it. Simple and clear.

This is unfortunately entirely false in C thanks to the aforementioned alignment/padding UB (and of course inner pointers). This is addressed with std::span. You'd still have to reinterpret_cast your structure to get the UB.

> Why should people complexify and uglify their C++ code with the uint8_t pointer (or std::byte), when void* works just fine??

tl;dr: because it doesn't. It just kinda looks like it does if you squint, and it's going to lead to the gnarliest bugs in the world.

> even if the variable is declared static

No, for static even padding bytes are zero.

For automatic, yes it may effectively turn a = {} to a.member = 0, leaving the padding bytes uninitialised. Or on copies like a = b it may not copy padding bytes.

Padding bytes are initialized to zero if you zero initialize the aggregate. It is hard to keep those bytes as zero but at initialization this much is guaranteed.
I looked into it some more and it's actually worse.

For static or thread storage, in C11 and later, ={0} will guarantee padding is zeroed. For automatic storage, per C11 6.7.9, only subobjects are required to be zeroed. Padding is not. [1]

In C23 initializing with ={} will give you zeroed padding, initializing with ={0} will not.

[1] https://www.open-std.org/jtc1/sc22/wg14/www/docs/n1548.pdf

> Some of which is not actually UB because the implementation defines it

No - if something is UB in the spec, it's UB. The implementation will do something, sure, but what it does is not fixed and may even change based on compiler version and optimization level.

> DWORD-sized memory access is atomic on Windows because Microsoft said it is

Well, Intel said it is. Mind you I don't think there are any 32-bit native architectures where aligned dword access isn't atomic. Unaligned, on the other hand ...

"Undefined behavior" in the C standard literally means "behavior which this C standard does not put any requirements on" - it says so in the definitions section of the C standard. Other things can still put requirements on it. MSVC isn't just a C++ compiler - it's a C++ compiler for x64 Windows and therefore follows the rules of C++, x64, and Windows all at once.
> No - if something is UB in the spec, it's UB.

A compiler is still free to ignore the spec and declare that something is not UB. However, this is very much compiler based, not platform based. Windows might guarantee that aligned DWORD-sized memory accesses are atomic, but that doesn't mean Clang when compiling for Windows would respect this - but MSVC might.

No, a compiler obviously cannot do this. nothing is undefined behaviour under a known compiler, version, and settings. UB means you can't know what the code does in general not that you can't know what it does in a very specific case.
UB has 2 very different implications:

1. It means that even if your program happens to work, it can't be portable

2. It means that even if your program happens to work today, it might stop working tomorrow when you add some new code, when you change some compiler flags, or when you do even a minor compiler upgrade

Of course, a compiler can't address 1. However, a compiler can very much address item 2. If Microsoft were to say "in MSVC, we define integer overflow to wrap", then they would guarantee that `INT_MAX + 1` will produce `INT_MIN` regardless of any optimization settings, any compiler upgrades, any other changes to the code. Of course, compiling the exact same program with Clang or GCC might cause it to crash or corrupt memory or anything else - but as long as you stuck with MSVC, your program would have perfectly defined semantics.

This is similar to using compiler extensions or intrinsics - they are not portable and not defined by the standard, maybe even explicitly defined to NOT be supported per the standard (such as variable length arrays in C++ in GCC), but they are nevertheless perfectly safe as long as you stick to your chosen compiler.

Edit to add: the integer overflow example is not just a theoretical possibility - lots of C++ compilers provide the `-fwrapv` flag; when using that flag, signed integer overflow is no longer UB for that program, it is defined just the same as unsigned integer overflow.

There is a difference between UB in C, and something being undefined in some version of Microsoft C on Windows.

Many of C's UB is specifically, intentionally left undefined in the standard to express code that relies on some specific way it is handled, is not proper, portable C. Indeed, the DWORD-sized memory access being atomic doesn't apply to MS Windows prior to version 3.0 running on a 80286.

It's UB because the ISO C spec says it's UB.