Hacker News new | ask | show | jobs
by temac 1414 days ago
> it’s UB to take the address of a parameter via any mechanism other than explicitly taking the address - which a compiler obviously sees, and because it’s UB the compiler optimizes is free to assume no one is taking the address.

I don't get that: can you express in C++ a code that "take the address of a parameter via any mechanism other than explicitly taking the address"?

2 comments

Consider this C code (also "works" if compiled as C++):

    int main(void) {
        int x = 0;
        int arr[1];
        int *p = arr + 1;
        *p = 42;
        return x;
    }
On a lot of systems (e.g., https://godbolt.org/z/jYqM8TT3Y), it just so happens that `x` is right above `arr` on the stack, so that code will return 42. But that code is absolutely UB.

The more general name for this concept is "pointer provenance". Basically, you can't pull pointer values out of thin air; you have to derive them from operations rooted at taking the address of something within the same allocation.

That's a buffer overflow. The optimizer doesn't need to reason about changing the behavior of such things.
That’s the point - it’s UB to go off the end of an array, or more generally to dereference a pointer outside of the bounds of the target object (yes, buffer+buffer_length is a valid pointer for the purpose of comparisons, but dereferencing it would be UB).

However in practice you can do this, and walk the stack to find the parameters or what have you, and then you’ve got a pointer to a parameter without the compiler being aware of it. But this is all explicitly UB, so it’s ok for the compiler to be unaware of it, and it’s free to do whatever codegen it wants given the assumption that UB can never happen.

The point is that on systems where that code returns 42, `p` has the exact same value it would if I did `int *p = &x;` instead, but not the same provenance.
And because C++ says "one past the end" pointers are a thing, both these pointers can exist.

As written p is a one-past-the-end pointer into object arr, but the address one past the end of arr may well be the address of x. If pointers are just addresses, these pointers are the same... right?

Neither C nor C++ currently actually explain how this works for their "abstract machine" in the standards documents. The reality is that your C++ compilers (and any non-toy C compilers) have pointer provenance because it's a nightmare to optimise C programs without, but since it isn't documented anywhere (my understanding is that C23 might fix this for C by taking a TS and an equivalent fix via P2318 could land in C++ 26) it's difficult to say if you ever find bugs in their behaviour.

> And because C++ says "one past the end" pointers are a thing, both these pointers can exist.

While one-past-the-end pointers are allowed to exist, they are not allowed to be dereferenced.

> these pointers are the same... right

The entire point of provenance is that even though their numerical values are the same, they are not the same.

> Neither C nor C++ currently actually explain how this works for their "abstract machine" in the standards documents.

While it isn't mentioned explicitly, it can be inferred from other things that the standard does say. The compiler authors didn't just make it up.

> it can be inferred from other things that the standard does say. The compiler authors didn't just make it up.

They didn't "just make it up" but well, here's (a draft of) TS 6010 explaining where it comes from, alas it's not "inferred from other things that the standard does say" but rather riffing on a phrase from a discussion about a defect report...

""In a committee discussion from 2004 concerning DR260, WG14 confirmed the concept of provenance of pointers, introduced as means to track and distinguish pointer values that represent storage instances with same address but non-overlapping lifetimes. Implementations started to use that concept, in optimisations relying on provenance-based alias analysis, without it ever being clearly or formally defined, and without it being integrated consistently with the rest of the C standard.""

TS 6010 will, some day, actually define how this works. Well, it will define how it should work, and assuming compiler vendors can be bothered to implement TS 6010 then it becomes how it actually works.

In TS 6010 (which again, is not how your C or C++ compiler works today, and in the best case won't be how your C++ compiler is required to work until at least 2027 or so) the rules go roughly like this:

* If you've got an actual pointer to a living object via some legitimate means, e.g. you used the & operator in C, that works

* If you try to make pointers from somewhere else, e.g. doing arithmetic on pointers that point to a different object, this only works if you've previously done some operation which might cause non-pointer stuff to be aware of this pointer, e.g. you cast a pointer to an integral type or you type-punned a pointer and then looked at the bytes

* However, the compiler is obliged to give you the benefit of the doubt about pointer types, if it's possible you knew how to make a Doodad* with this address in it, then the fact that you also knew how to make a Foo* with the same address doesn't matter, your program is allowed to make a Doodad* not a Foo* if it wants

Thus, your example up thread is still Undefined Behaviour under TS 6010, because you've got no reason to believe the memory layout is the way it actually was. But if you use some type punning hack to get the address of x into that pointer instead, TS 6010 says that works and is not Undefined Behaviour.

Imagine the parameter is passed on the stack, I can take the address of a local variable (which in general forces it to be on the stack), I can then walk up the stack from that address to where the parameter is.

This is undefined behavior. Because it is UB that compiler is allowed to assume it cannot happen. Therefore I have the address of a parameter, and can pass that to a closure or whatever, and the parameter has escaped, but the compiler doesn't know.

More importantly by definition the compiler does not need to know, because a program that does that is no longer well defined.