| HN Mirror

Well... I didn't describe the C++11 memory model above. I had a gross simplification, because I didn't account for how the C++11 memory model acts "relative to a variable". (And this "variable" is typically the mutex itself).

I don't know much about the Linux Kernel, but I gave the document a brief read-over (https://www.kernel.org/doc/Documentation/memory-barriers.txt).

My understanding is that WRITE_ONCE / READ_ONCE are meant for this "relative to a variable" issue. Its _precisely_ the issue I ignored in my post above.

All C++11 atomics are "relative to a variable". There are typically no memory-barriers floating around by themselves (there can be, but, you probably don't need the free-floating memory barriers to get the job done).

So you wouldn't write "acquire_barrier()" in C++11. You'd write "atomic_var.store(value, memory_order_release)", saying that the half-barrier is relative to atomic_var itself.

----------

    a();
    b();
    while(val = atomic_swap(spinlock, 1, acquire_consistency), val!= 0) hyperthread_yield(); // half-barrier, write 1 into the spinlock while atomically reading its previous value
    c();
    d();
    e();
    atomic_store(spinlock, 0, release_consistency); // Half barrier, 0 means we're done with the lock
    f(); 
    g();

So the C++ acquire/release model is always relative to a variable, commonly the spinlock.

This means that "c, d, and e" are protected by the spinlock (or whatever synchronization variable you're working with). Moving a or b "inside the lock" is fine, because that's the "unlocked region", and the higher-level programmer is fine with "any order" outside of the locked region.

Note: this means that c(), d(), and e() are free to be rearranged as necessary. For example:

    while(val = atomic_swap(spinlock, 1, acquire_consistency), val!= 0) hyperthread_yield(); // half-barrier, write 1 into the spinlock while atomically reading its previous value
    for(int i=0; i<100; i++){
      value+=i;
    }
    atomic_store(spinlock, 0, release_consistency); // Half barrier, 0 means we're done with the lock

The optimizer is allowed to reorder the values inside into:

    while(val = atomic_swap(spinlock, 1, acquire_consistency), val!= 0) hyperthread_yield(); // half-barrier, write 1 into the spinlock while atomically reading its previous value

    for(int i=99; i>=0; i--){ // decrement-and-test form is faster on many processors
      value+=i;
    }

    atomic_store(spinlock, 0, release_consistency); // Half barrier, 0 means we're done with the lock

Its the ordering "relative" to the spinlock that needs to be kept. Not the order of any of the other loads or stores that happen. As long as all value+=i stores are done "before" the atomic_store(spinlock) command, and "after" the atomic_swap(spinlock) command, all reorderings are valid.

So reordering from "value+=0, value+=1, ... value+=99" into "value+=99, value+=98... value+=0" is an allowable optimization.

----------

It seems like WRITE_ONCE / READ_ONCE was written for DEC_Alpha, which is far weaker (less guarantees about order) than even ARM. DEC_Alpha was the first popular multicore system, but its memory model allowed a huge number of reorderings.

WRITE_ONCE / READ_ONCE probably compile into no-ops on ARM or x86. I'm not 100% sure, but that'd be my guess. I think the last 20-years of CPU design has overall said that the DEC_Alpha's reorderings were just too confusing to handle in the general case, so CPU designers / low-level programmers just avoid that situation entirely.

"dependent memory accesses" is very similar to the confusing language of memory_order_consume. Which is again: a model almost no one understands, and almost no C++ compiler implements. :-) So we can probably ignore that.