Hacker News new | ask | show | jobs
by adwn 1069 days ago
> the C representation closely matches what is actually happening

It really doesn't, though. Although your CPU might present system RAM as one contiguous array of bytes to your program, the C compiler follows different rules – see strict aliasing and other pointer dereference rules. For example, the following is Undefined Behavior and your C compiler may or may not generate the assembly you expect:

    int x = *(int *)0x1234568;
Your CPU would happily execute the equivalent machine instructions and load from address 0x12345678, while a C compiler is free to replace your entire program with return 0;
4 comments

Casting an integer to a pointer is implementation defined, not UB.

And every sane implementation does what everyone expects because its how memory mapped IO works (but you probably want a volatile in there and maybe a compiler or memory barrier as well depending on what the hardware guarantees about the access patterns for that particular range of addresses)

> Casting an integer to a pointer is implementation defined, not UB.

You're right, that was a bad example. Here's a better one:

    int x, y;
    ptrdiff_t diff = &x - &y;
This is Undefined Behavior, because &x and &y don't point to the same object.
The original author was talking about hardware not behaving like linear memory, and other than caches and maybe some thread local tricks, I'm not sure what he meant. However, it seems pretty clear that CPUs do try really hard to make:

    mov rax, qword ptr [0x12345678]
do what you think it would/should.

And as for the C memory model, aliasing, and optimizations, I'm firmly in the camp that thinks the standards originally gave the compiler writers an inch to work on weird platforms and they've taken a mile when they work on reasonable ones. The intent of your integer to pointer cast is very clear, but it's been undefined to insanity. So now there is some variant of the following, which doesn't have UB but does the exact same thing less clearly:

    uintptr_t i = 0x12345678;
    int* p = 0;
    memcpy(&p, &i, sizeof(int*));
    int x = *p;
I'm sure some language lawyer will correct me on some obscure detail of the standard, but it could be fixed with some modification. The point to me is that using memcpy instead of pointer casts is NOT an improvement. The good compilers will generate the same code as the assembly above, so all they've done is made the C source less readable.
> The point to me is that using memcpy instead of pointer casts is NOT an improvement.

The improvement comes when there are multiple accesses that could potentially point to the same memory. Consider a silly function:

    void f(int16_t* a, int32_t* b) {
      for (int32_t i = 0; i < 100; i++) {
        b[i] = a[0] + i;
      }
    }
If type-based alias analysis is enabled, then the compiler can assume that a[0] does not alias b[i] because they are different pointer types. So it can hoist the load of a[0] outside the loop, improving efficiency. If strict aliasing is disabled, it cannot assume this, so it must reload a[0] each time: https://godbolt.org/z/E7jxfYsbx

The memcpy() makes it clear that the memory could alias anything, so it will generate the less efficient code even if strict aliasing is enabled: https://godbolt.org/z/KoPxK9fPj

Memory aliasing is a huge thorn in the side of the optimizer, because the compiler frequently has to allow for the possibility that different pointers will alias each other, even if they never will in practice. The code might end up being slower than necessary for no real reason. Strict aliasing is one of the few tools we have to tell the compiler that aliasing will not occur.

I don't think that C actually forbids this code:

     *(int*)0x12345678
The rule is just: if you access it as an int, you have to consistently access as an int. You can't mix types from one access to the next, eg:

    *(long*)0x12345678
    *(int*)0x12345678
> Strict aliasing is one of the few tools we have to tell the compiler that aliasing will not occur.

I can see the argument, but there's a much better way to indicate what you want with your example:

    void f(int16_t* a, int32_t* b) {
      const int16_t a0 = a[0];
      for (int32_t i = 0; i < 100; i++) {
        b[i] = a0 + i;
      }
    }
Now a clean (well defined) compiler could do what you asked.

I've seen other people suggest that UB is a mechanism to have these magical backdoor conversations with the compiler to express optimization opportunities. I think that's absurd and reckless. Propose adding assertions or "declare" statements instead, and quit thinking of interpretive dance through a minefield as a method of communication.

You are entitled to your opinion. C isn't perfect, but as someone who spends my life trying to optimize the efficiency and code size of critical loops to the max, I like the direction C has gone with UB and optimizations. It's not the right tool for every problem, but for the most size/speed critical code it's hard to beat IMO.
> I don't think that C actually forbids this code:

     *(int*)0x12345678
If not, give it time. It was only a few years ago when you were allowed to use a union for that kind of thing. I really believe they'll eventually make everything except unsigned integers be UB.

"Oh, the code was never correct. You just got lucky before."

If you want to load from address 0x1234568, assign it to a char pointer first. Then the cast is legal and defined.

Your point that C is stricter than asm of course still stands.

> and load from address 0x12345678

and most likely seg fault, or similar

1. If the CPU lacks an MMU and the address falls into an accessible address space, it won't segfault.

2. If the CPU has an MMU, it won't segfault if the address is mapped to an accessible region of memory.

3. This is besides the point, because the CPU will execute the instruction and attempt to load from that address. A C compiler might emit the load instruction, or it might assume that this code branch will never be executed and can therefore be replaced with code that sends an angry email to your mother.