Hacker News new | ask | show | jobs
by dragontamer 2404 days ago
A good, introductory, high-level overview of what is going on with cache coherence... albeit specific to x86.

ARM systems are more relaxed, and therefore need more barriers than on x86. Memory barriers (which also function as "compiler barriers" for the memory / register thing discussed in the article) are handled as long as you properly use locks (or other synchronization primitives like semaphores or mutexes).

Its good to know how things work "under the covers" for performance reasons at least. Especially if you ever write a lock-free data-structure (not allowed to use... well... mutexes or locks), so you need to place the barriers in the appropriate spot.

------

I think the acquire/release model of consistency will become more important in the coming years. PCIe 4.0 is showing signs of supporting acquire/release... ARM and POWER have added acquire/release model instructions, and even CUDA has acquire/release semantics being built.

As high-performance code demands faster-and-faster systems, the more-and-more relaxed our systems will become. Acquire/release is quickly becoming the standard model.

3 comments

I don't see anything x86 specific on the article. It focuses on cache coherency, which is applicable to ARM and POWER, and there isn't much about memory model.

Even the description of MOESI is just an introduction and, as the article mentions, actual systems use more complicated protocols.

Edit: if anything, the misconception is that memory barriers have anything to do with cache coherency.

> ARM and POWER have added acquire/release model instructions

They have implemented the acquire-release consistency model since day one (or, the day they started supporting multi-processors). Yes, there are some subtleties there that have in some cases been tightened later on, e.g. multi-copy atomicity.

IIRC, there was a big stink because ARM and POWER historically implemented consume/release semantics, which is very slightly more relaxed than the now de-facto standard acquire/release semantics.

ARM and POWERPC CPU devs worked very hard to get consume/release into C++11, but no compiler writer actually implemented that part of the standard. As such, consume/release can be safely forgotten into the annals of computer history (much like DEC Alpha's fully relaxed semantics)

Then in ARM8, ARM simply added LDAR (Load-acquire) and STLR (Store-release) instructions. https://developer.arm.com/docs/100941/0100/barriers . So the ARM CPU how fully supports the acquire/release model. Apparently IBM's POWER instruction set was similarly strengthened to acquire/release (either POWER8 or POWER9).

ARM / POWER "normal" loads and stores are still consume/release semantics. But compilers can simply emit LDAR (load-acquire) for the stronger guarantee.

----------

I remember at least one talk that showed that consume/release is ideal for things like Linux's RCU or something like that (that acquire/release is actually "too strong" for RCU, and therefore suboptimal). But because compiler-writers found consume/release too hard to reason about in practice, we're stuck with acquire/release.

It seems like the C++ standard continues to evolve to push for memory_order_consume (http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p075...), but all the details are still up for discussion.

> ARM and POWERPC CPU devs worked very hard to get consume/release into C++11

AFAIR Paul McKenney was the primus motor, and the motivation was largely RCU. Then again, McKenney also worked for IBM at the time and certainly had an interest in pushing a model that mapped well to POWER.

But it turned out to be both somewhat mis-specified and hard to implement cleanly, so most compilers just implemented it as an acquire.

As you mention, there is ongoing work to fix it.

As for ARM, it seems the big thing they've done since the initial release of ARMv8 is to banish non-multicopy atomicity. See https://www.cl.cam.ac.uk/~pes20/armv8-mca/armv8-mca-draft.pd...

> As high-performance code demands faster-and-faster systems, the more-and-more relaxed our systems will become. Acquire/release is quickly becoming the standard model.

Linux relies heavily on performant RCU for scalability, which a pure acquire/release SW programming model can't support.

There must be some kind of communication error going on. I don't know much about RCU, so I just pulled up this webpage:

https://www.kernel.org/doc/Documentation/RCU/whatisRCU.txt

In it is:

> The rcu_read_lock() and rcu_read_unlock() primitive read-acquire and release a global reader-writer lock.

Seems like RCU-operations in the Linux kernel are defined in acquire-barrier and release-barrier terms. I heard a while ago that RCU could be discussed in terms of release-consume semantics (which are slightly faster but harder to understand...) but very few people understand release-consume.

As such, release-acquire is probably the memory model of the future. I'm not really aware of anything aside from: Fully Relaxed (unordered), the obscure release-consume, release-acquire, and finally sequentially consistent (too slow for modern systems)

---------

Are you perhaps confusing "acquire-release" semantics (which is a memory-barrier / cache coherence principle) with spinlocks perchance? Acquire-release seems to be the "Fastest-practical" memory consistency model. (Since Relaxed doesn't work, and release-consume is too confusing)

For more info on acquire-release, Preshing's blogposts are great: https://preshing.com/20130922/acquire-and-release-fences/

> I'm not really aware of anything aside from: Fully Relaxed (unordered), the obscure release-consume, release-acquire, and finally sequentially consistent (too slow for modern systems)

What about Total Store Ordering (TSO), which is what e.g. the obscure and rare x86(-64) architecture implements (and SPARC as well)?

That is a good point: the x86 model is "stronger" than acquire-release. Which is probably why it took so long for acquire-release to become beneficial. Any x86 coder who codes in acquire-release will not see any performance benefit on x86, because x86 implements "stronger guarantees" at the hardware level.

Well, that is until you enable gcc -O3 optimizations, which will move memory around, merge variables together, and other such optimizations that will follow the acquire-release model instead of TSO. Remember that the compiler has to consider the memory-consistency model between registers and RAM (when are registers holding "stale" data and need to be re-read from RAM?)

-------

The thing is, acquire-release is becoming far more popular and is the golden-standard that C++11 has more or less settled upon. C++11, ARM, POWER9, CUDA, OpenCL have moved onto acquire-release semantics for their memory model.

Next generation PCIe 5.0, CXL, OpenCAPI, are all looking at extending cache-coherence out to I/O devices such as NVMe flash and GPUs / Coprocessors. I'm betting that Acquire/release will become more popular in the coming years. TSO is too "strict" in practice, people actually want their reads-and-writes to "float" out of order with each other in most cases, especially when you're talking about a PCIe-pipe that takes 5-microseconds (20,000 clock-ticks!!) to communicate over.

> That is a good point: the x86 model is "stronger" than acquire-release. Which is probably why it took so long for acquire-release to become beneficial. Any x86 coder who codes in acquire-release will not see any performance benefit on x86, because x86 implements "stronger guarantees" at the hardware level.

Yes, in a way it's a race to the bottom; code that works on TSO hw works on acquire-release hw, but not the other way around. There's only two ways to combat this race: education, and using concurrency libraries written by people who know what they're doing.

> acquire-release is becoming far more popular and is the golden-standard that C++11 has more or less settled upon

Hmm, how come? C++11 supports many different models, relaxed, acquire/release, and sequential consistency, with sequential consistency being the default for atomic variables. Now, acquire/release looks like a decent compromise between ease of hw implementation and programming complexity, but AFAICS it's not the anointed one true model.

To some extent I think that's a failing of the C++11 model. Instead of choosing one (sane) model, they made people choose between an array of models with subtle semantics. That's what the recent formal Linux kernel model did, although that's not ideal either, with the requirement to not be too different from the previous informal description and boatloads of legacy code. See http://www0.cs.ucl.ac.uk/staff/j.alglave/papers/asplos18.pdf

In general, it seems to me that progress is being made in formal memory models, and I hope that in some years time there will be some kind of synthesis giving us a model that is both reasonably easy to implement in hw with good performance, easy enough to reason about, as well as formally provable. We'll see.

> Hmm, how come? C++11 supports many different models, relaxed, acquire/release, and sequential consistency, with sequential consistency being the default for atomic variables. Now, acquire/release looks like a decent compromise between ease of hw implementation and programming complexity, but AFAICS it's not the anointed one true model.

Well, nothing will ever be "officially" blessed as the one true model. As the saying goes: we programmers are like cats, we all will be moving off in our own direction, doing our own thing.

Overall, I just think that "programmer culture" is beginning to settle down on Acquire-release semantics. Its just a hunch... but more-and-more languages (C++, CUDA), and systems (ARM, POWER, NVidia GPUs, AMD GPUs) seem to be moving towards Acquire-release.

And in the next few years, we'll have cache-coherency over PCIe 4.0 or PCIe 5.0 in some form (CXL or other protocols on top of it). A unified memory model across CPU, DDR4 RAM, the PCIe-bus, and co-processors (GPUs, FPGAs, or Tensor cores), and high-speed storage (Optane and Flash SSDs over NVMe) is needed.

The community is just a few years out from having a unified memory model + coherent caches across the I/O fabric. Once this "defacto standard" is written, it will be very hard for it to change. That's why I think acquire-release is here to stay for the long term. Its the leading memory model right now.

Keep in mind that even when the underlying hardware implements TSO, what programming languages expose is basically the release/acquire model of memory semantics.

This means that as a programmer, you still have to code against the release/acquire model because the compiler may reorder your memory accesses. Having TSO in hardware is still helpful though, because it means the compiler has to emit fewer explicit barrier instructions at the end. That is, the barriers that you do have in your original code end up being a little bit cheaper (at the cost of having an overall more complex hardware architecture).

In TSO every store is a release and every load is a acquire, so it maps very efficiently to the acquire/release model.
Sure, since it's a stronger model, so code which works on weaker acquire/release hw will work on TSO hw as well. You might as well say that sequential consistency maps very efficiently to a acquire/release model too in the same way.
Isn't TSO the closest practical implementation to a acquire/release model? What are the practical differences?

I know that TSO allows more easily to recover sequential consistency with additional barriers (Intel strengthened their original memory model to TSO for this reason).