Hacker News new | ask | show | jobs
by amluto 70 days ago
> On x86 a spinlock release doesn't need a memory barrier (unless you do insane things) / lock prefix, but a futex based lock does (because you otherwise may not realize you need to futex wake).

Now you've gotten me wondering. This issue is, in some sense, artificial: the actual conceptual futex unlock operation does not require sequential consistency. What's needed is (roughly, anyway) an release operation that synchronizes with whoever subsequently acquires the lock (on x86, any non-WC store is sufficient) along with a promise that the kernel will get notified eventually (and preferably fairly quickly) if there was a non-spinning sleeper. But there is no requirement that the notification occur in any particular order wrt anything else except that the unlock must be visible by the time the notification occurs [0]; there isn't even a requirement that the notification not occur if there is no futex waiter.

I think that, in common cache coherence protocols, this is kind of straightforward -- the unlock is a store-release, and as long as the cache line ends up being written locally, the hardware or ucode or whatever simply [1] needs to check whether a needs-notification flag is set in the same cacheline. Or the futex-wait operation needs to do a super-heavyweight barrier to synchronize with the releasing thread even though the releasing thread does not otherwise have any barrier that would do the job.

One nasty approach that might work is to use something like membarrier, but I'm guessing that membarrier is so outrageously expensive that this would be a huge performance loss.

But maybe there are sneaky tricks. I'm wondering whether CMPXCHG (no lock) is secretly good enough for this. Imagine a lock word where bit 0 set means locked and bit 1 set means that there is a waiter. The wait operation observes (via plain MOV?) that bit 0 is set and then sets bit 1 (let's say this is done with LOCK CMPXCHG for simplicity) and then calls futex_wait(), so it thinks the lock word has the value 3. The unlock operation does plain CMPXCHG to release the lock. The failure case would be that it reports success while changing the value from 1 to 0. I don't know whether this can happen on Intel or AMD architectures.

I do expect that it would be nearly impossible to convince an x86 CPU vendor to commit to an answer either way.

(Do other architectures, e.g. the most recent ARM variants, have an RMW release operation that naturally does this? I've tried, and entirely failed AFAICT, to convince x86 HW designers to add lighter weight atomics.)

[0] Visible to the remote thread, but the kernel can easily mediate this, effectively for free.

[1] Famous last words. At least in ossified microarchitectures, nothing is simple.

2 comments

> > On x86 a spinlock release doesn't need a memory barrier (unless you do insane things) / lock prefix, but a futex based lock does (because you otherwise may not realize you need to futex wake).

> Now you've gotten me wondering. This issue is, in some sense, artificial: the actual conceptual futex unlock operation does not require sequential consistency. What's needed is (roughly, anyway) an release operation that synchronizes with whoever subsequently acquires the lock (on x86, any non-WC store is sufficient) along with a promise that the kernel will get notified eventually (and preferably fairly quickly) if there was a non-spinning sleeper. But there is no requirement that the notification occur in any particular order wrt anything else except that the unlock must be visible by the time the notification occurs [0]; there isn't even a requirement that the notification not occur if there is no futex waiter.

Hah.

> ... > But maybe there are sneaky tricks. I'm wondering whether CMPXCHG (no lock) is secretly good enough for this. Imagine a lock word where bit 0 set means locked and bit 1 set means that there is a waiter. The wait operation observes (via plain MOV?) that bit 0 is set and then sets bit 1 (let's say this is done with LOCK CMPXCHG for simplicity) and then calls futex_wait(), so it thinks the lock word has the value 3. The unlock operation does plain CMPXCHG to release the lock. The failure case would be that it reports success while changing the value from 1 to 0. I don't know whether this can happen on Intel or AMD architectures.

I suspect the problem isn't so much the lock prefix, but that the non-futex spinlock release just is a store, whereas a futex release has to be a RMW operation.

I'm talking out of my ass here, but my guess is that the reason for the performance gain of the plain-store-is-a-spinlock-release on x86 comes from being able to do the release via the store buffer, without having to wait for exclusive ownership of the cache line. Due to being a somewhat contended simple spinlock, often embedded on the same line as the to-be-protected data, it's common for the line not not be in modified ownership anymore at release.

> I suspect the problem isn't so much the lock prefix, but that the non-futex spinlock release just is a store, whereas a futex release has to be a RMW operation.

> I'm talking out of my ass here, but my guess is that the reason for the performance gain of the plain-store-is-a-spinlock-release on x86 comes from being able to do the release via the store buffer, without having to wait for exclusive ownership of the cache line.

I don’t think so. The CPU is pretty good about hiding that kind of latency — reading a contended cache line and doing a correctly predicted branch shouldn’t stall anything after it.

But LOCK and MFENCE are quite expensive.

Using LOCK CMPXCHG or even plain CMPXCHG does not make sense unless it is done in a loop, which tests the success of the operation.

Implementing locks does not need this kind of loops, which may greatly increase the overhead, but only loops that do simple loads, for detecting changes, or the invocation of a FUTEX_WAIT, which is equivalent with that.

Besides loops that wait for changes, any kind of lock may be implemented with atomic read-modify-write instructions (e.g. on x86 XCHG, LOCK XADD, LOCK BTS and so on, and equivalent instructions on Armv8.1-A or later ISAs) that are not used in loops, so they have predictable overhead. For example, a futex may be used by a thread that waits for multiple events, if the other threads use a locked bit-test-and-set on the futex variable to signal the occurrence of an event, where each event is assigned to a distinct bit.

CMPXCHG and the equivalent load-and-lock/store conditional are really needed far less often than some people use them. The culprit is a widely-quoted research paper that has shown that these instructions are more universal than simple atomic fetch-and-operation instructions, allowing the implementation of lock-free algorithms, but the fact that they can do more does not mean that they should also be used when their extra power is not necessary, because that is paid dearly by introducing non-deterministic overhead.

A simple atomic instruction has an overhead much greater than an access to the L1 cache or the L2 cache, but typically the overhead is similar to that of a simple access to the L3 cache and significantly lower than the overhead of a simple access to the main memory, which remains the most expensive operation in modern CPUs.

Moreover, while mutual exclusion can be implemented reasonably efficiently with locks, it is also used far more often than necessary. It is possible to implement shared buffers or message queues that use neither mutual exclusion nor optimistic access that may need to be retried (a.k.a. lock-free access), but instead of those they use dynamic partitioning of the shared resource, allowing concurrent accesses without interference.

Organizing the cooperation between threads around shared buffers/message queues is frequently much better than using mutual exclusion, which stalls all contending threads, serializing their execution, and also much better than lock-free access, which may need an unpredictable number of retries when contention is high.

You are misunderstanding me, which is perhaps understandable, since I’m talking about the minutiae of x86, not locking in general.

When unlocking a futex-backed mutex, one needs to do two things. First, one needs to actually unlock it: this is a store-release in modern lingo, and on x86 almost any store instruction has the correct ordering semantics. Second, one needs to determine whether to call futex_wake, which is conceptually just reading a flag “is someone waiting” and then branching on the result. The problem is that the load needs to be ordered after (or at least not before) the store.

x86 provides two main ways to do this, MFENCE and LOCK. For whatever reason, at least Intel has tried pretty hard to optimize LOCK, and it’s often the case that LOCKed operations on a hot cache line is faster than MFENCE. (I have benchmarked this, and Linux uses this trick.)

My point is that the specific algorithm of unlocking a futex-backed mutex does not require the full ordering semantics of MFENCE or LOCK. And my secondary observation is that x86 has some non-LOCKed RMW instructions, one of which is plain CMPXCHG. Unlocked CMPXCHG is much faster than LOCK anything or MFENCE — I’ve benchmarked it. There are also the flag outputs from operations like ADD. And I’m speculating that maybe some of these instructions are secretly actually ordered strongly enough for futex unlock.