|
I may be wrong, but I don't see this working in theory. The basic problem is that for any physical core, there is no guarantee that any writes will ever be seen by any other core (unless the necessary extra magic is done to make this so). (There's a matching problem when reading, too, in that in theory writes made by another core can be never seen by the reading core, even if they have reached the domain covered by cache coherency, unless the necessary extra magic, etc.) So here with this code, in theory, the writes being made simply are never seen by another core, and so it doesn't work. As it is, in practise, writes typically make their way to other physical cores in a reasonably short time, whether or not the extra magic is performed; but this is not guaranteed. Also of course the ordering of the emergence of those writes is not guaranteed (although on some platforms, additional guarantees are provided - Intel is particular generous in this regard, and performance suffers because of it). |
They're using atomic read/writes with sequential-consistency.
This means that the compiler will automatically put memory-barriers in the appropriate locations to guarantee sequential-consistency, but probably at a significant performance cost.
The implementation is likely correct, but its the slowest performance available (seq-cst is the simplest / most braindead atomic operation, but slowest because you probably don't need all those memory barriers).
I'd expect that Acq-release style barriers is possible for this use case, which would be faster on x86, POWER, and ARM systems.