Shared memory is way outside the scope of standard C or C++. It's implementation-defined. It's inconsistent to insist on the weakest definition of atomics allowed by the C/C++ standard(s) and simultaneous invoke one of the weirdest implementation-defined mechanisms defined by POSIX. If your implementation provides shared memory of some kind, it's up to your implementation to define some sort of reasonable semantics.
In POSIX' case, it's up to POSIX operating systems to define reasonable semantics on the memory, using constructs like PTHREAD_PROCESS_SHARED and "robust" pthread mutexes.
> C++ atomics are no good here, because they are not guaranteed to be lock free or address free.
That's not right; you can still use std::memory_order to get the memory barriers generated that are required. These are going to obviously be lock free, they deal with memory ordering—what you tried to deal with volatile, but in general case.
This is probably not useful for production, but volatile is a great way to see what kind of code compiler generates in a realistic setting. For example, if you want to see how compiler optimizes a code snippet and the code depends on a constant that you don't want to get constant folded away.
That is a reasonable heuristic but your statement is not technically correct. E.g. you need volatile around setjmp/longjmp and that has nothing to do with IO.
And if your GC doesn't dump the registers. Only with volatile you can keep all locals on the stack.
And yes, that's not stupid. It's actually faster than all the register "optimizations" for practical use cases in fast VM's. Register saving across calls and at the GC is much more expensive. mem2reg is an antipattern mostly
C++ atomics are no good here, because they are not guaranteed to be lock free or address free.