Hacker News new | ask | show | jobs
by jnordwick 2363 days ago
This comes around every so often, and it isn't very interesting in that the best mutexes basically spins 1 or a couple times then falls back to a lock. It isn't a true pure spinlock vs pure lock (mutex/futex) fight.

I think the linux futex can be implemented through the VDSO (can somebody correct me on this), so that eliminates the worse of the sycall costs.

His benchmark is weird, but maybe I'm reading it wrong:

* Instead of reducing thread count to reduce contention he appears to increase the number of locks available. This is still a bad scenario for spinlocks since they will still have bad interactions with scheduler (they will use a large amount of cpu time when and get evicted from the run queue and need to be rescheduled).

* Also, I didn't see him pin any of the threads, so all those threads will start sharing some cpus since the OS does need to be doing some work on a them too.

* And Rust can be a little hard to read, but it seem he packed his locks on the same cache line? I don't see any padding in his AmdSpinlock struct. That would be a huge blow for the spinlock because of the false sharing issues. He's getting all the cache coherence traffic still because of it.

The worst cases for the spinlock are not understanding scheduling costs and the cache thrashing that can occur.

What they call the AMD spinlock (basically just a regular sane spinlock that tries to prevent cache thrashing) has its best performace with a low number of threads, assigned to different cores under the same L3 segment.

(Does anybody know if AMD's new microarchitecture went away from the MOESI/directory based cache towards Intel's MESIF/snoop model?)

The MOESI model might have performed better in this regard under worse case scenario since it doesn't need to write the cache line back and can just forward around the dirty line as it keeps track of who owns it.

And if you run under an MESIF-based cache and you can keep your traffic local to your L3 segment, you are backstopped there and never need to go anywhere else.

A spinlock is a performance optimization and should be treated as one. You need to have intimate knowledge of the architecture you are running under and the resources being used.

(edit: answered my own question, apparently the vdso is still pretty limited in what it exports, so no. it isn't happening at this time from what i can tell.)

3 comments

The locks should be on separate cachelines, that's what the CachePadded::new is for.

futex cannot be implemented in the VDSO since it needs to call into the scheduler.

Another way to think about this: VDSO is used for syscalls that are (mostly) read-only and can avoid the kernel mode switch on a common fast path. The futex syscall is already the raw interface that is designed on the assumption that the caller only uses it in the slow path of whatever higher-level synchronization primitive they're implementing, so trying to use VDSO tricks to implement futex would be redundant.

> The locks should be on separate cachelines, that's what the CachePadded::new is for

I see that now. I looking for it in the AmdSpinlock struct, but that kind of makes sense.

> The futex syscall is already the raw interface that is designed on the assumption that the caller only uses it in the slow path of whatever higher-level synchronization primitive they're implementing, so trying to use VDSO tricks to implement futex would be redundant.

Ah. Thanks. I didn't know how far you could get with MWAIT, but I guess you still need to deschedule. I also didn't realize futex was a direct syscall and there was no user level api going on around it.

Is he running 32 threads even in the low contention case? And not pinning? There's something about his numbers that just seem a little too high for what I would expect. I've seen this around a lot, and the reason the mutex usually wins is that is basically does a spin of 1 or more then goes into a mutex (the pthread code appers to spin 100 times before falling back to a futex).

At work I use a spin lock on a shared memory region because it does test out to be lower latency than std::mutex and we're not under much contention. I've though about replacing it with a light futex-based library, but doesn't seem to be quicker.

He still seems to be getting some contention, and I'm trying figure out how.

> linux futex can be implemented through the VDSO (can somebody correct me on this), so that eliminates the worse of the sycall costs.

The semantic of a futex wait is a request to the kernel to put the thread to sleep until another thread issues a signal for the same "key".

The trick is that the key is the address of a 32 bit memory location (provided to the kernel in both the signal and wait syscalls), which is normally the mutex address.

So a process first tries to acquire the mutex as for a normal spin lock (with a CAS, exchange, xadd or whatever work for the specific algorithm) attempting to set the lock bit to 1; if it fails (possibly after spin-trying a few times), it invokes the futex wait. As you can see the syscall is only done in the slow path.

On a futex release, the simple solution is to always invoke futex signal syscall after setting the lock bit to zero[1]. To fast path the relase, a wait bit can be set on the acquire path just before calling futex wait, so on the release path, when setting the lock bit to zero, signal would only be called if the waiter bit was set (and the cleared by together with the lock bit)[2].

As you can see the futex syscall is already the slow path and never need to be onvoked in the uncontented case. In fact the kernel doesn't even need to know about the futex untill the first contention.

[1] futexes are edge triggered, same as condition variables, so a signal will only wakes any thread blocked ona wait that happened-before the signal call. Thus there is a race condition if a lock rrlease and signal happens between the failed acquire attempt and the wait call; to prevent this futex wait as an additional parameter that is the expected value of the memory location: the kernel checks the futex address against this valur and will only if it hasn't changed will put the thread to sleep (this is done atomically).

[2] as there could me multiple waiters, a broadcast is actually needed here which leads to a thundering herd as all waiters will race to acquire the lock. The original futex paper used a wait count instead of just a bit but there are other options.

According to "man vdso", futex is not in vDSO on any architecture. What eliminates the syscall cost is that it's rarely called due to elision.

A mutex may try to spin more than a "couple of times" before it calls futex. Example: https://git.musl-libc.org/cgit/musl/tree/src/thread/pthread_...

> is that it's rarely called due to elision.

Calling this 'elision' is a bit confusing, since the glibc developers use the term 'lock elision' to refer to hardware lock elision via Intel TSX extensions.

What eliminates the syscall cost is optimistic spinning in userspace, so that locking only does futex_wait in cases of contention when the lock isn't released 'soon' after the thread starts trying to acquire it.