Hacker News new | ask | show | jobs
by hackworks 2356 days ago
Not an expert here.

In a spin lock, the lock state is checked in a tight loop by all waiters. This will be using some sort of memory fence. FWIK, memory fence or barriers flush the CPU cache and would initiate reading the variable (spin lock state) for evaluation. I would expect spin locking overheads to increase with number of cores.

On NUMA, I think flushing is more expensive. Hence, spin locks have an additional overhead of having to load and evaluate on every spin as against being woken up for mutexes (like a callback)

3 comments

Yes, spin locks require atomic memory operations - as do many other OS primitives. CPUs have dedicated instructions to do this more quickly for locks than generalized coherent memory, for example cmpxchg on x86. There are microarchitectural optimizations to make these things work well.

I've been writing MP code since the early 90's; SPARC, MIPS, x86, PPC, SH-4, Alpha, and all have instructions to make this easier for you.

Spin locks are very useful for locks held for short durations, and only on MP systems, particularly in the kernel, since a spin lock there can deadlock the whole system on a single CPU machine. The very general rule of thumb is that you generally want blocking locks, unless there is a very specific reason to use a spin lock. Those very specific reasons might be waiting briefly on some device in a driver, or running a realtime system where wake latency is critical, etc. Most of the time, spin locks will burn CPU for no good reason, and you still have to implement the non-spinlock case anyway, in case you are running on a 1 CPU system. So, might as well start with blocking locks, and only add spin locks where performance would benefit from it.

It doesn’t flush the entire cache (that would be a disaster) but it does shoot down the cache line containing the lock in all cores other than the one that acquired the lock.

The real issue with spin locks is fairness. There’s no assurance that any given thread will ever make progress. A thread could starve forever. Production-ready mutexes like absl::Mutex make efforts toward fairness, even if they don’t have hard guarantees.

> but it does shoot down the cache line containing the lock in all cores other than the one that acquired the lock.

Well, as long as you do the test-CAS instead of the pure-CAS approach not every loop iteration results in cache line bouncing.

Plus intel has introduced the MWAIT[0] instruction to implement something similar to futex in hardware, i.e. the hyperthread can sleep until another core updates the cacheline in question.

[0] https://www.felixcloutier.com/x86/mwait

That's true, though MWAIT can only be used from kernel mode. At least the docs say that you get a #UD exception if you attempt to use it from user mode.
Right, there's UMWAIT[0] for that, but that's quite new.

[0] https://www.felixcloutier.com/x86/umwait

Aren't most modern spinlock implementations are ticket-based, which ensures fairness among waiters (FIFO-like)? The linux kernel implementations most definitely are.

I agree that the naive implementations are not.

Completely fair locks can have extremely terrible throughput performance, so depending what you are doing with them, they might not be a good idea...
> checked in a tight loop by all waiters

This actually does not have to be this way. You could have a linked list of spinlocks, one for each waiter. Each waiter spins on its own, unique spinlock. When the previous waiter is done it unlocks the next spinlock, and so on. The implementation gets a bit complicated on non-GC languages, since there are races between insertion/removal on the linked list. If the number of threads is fixed and long-lived then it becomes easier, since instead of a linked list you can have an array of spinlocks.

Note: in some architectures (x86?) you could possibly do away with atomics, since (I believe) int updates are atomic by default. Not really sure though.

Yep, that's the underlying principle behind Mellor-Crummey Spinlock: https://lwn.net/Articles/590243/.
Yes! The MCS spinlock, the name eluded me. The paper is actually a pretty good read (it is linked in the lwn article).
> you could possibly do away atomics, since (I believe) int updates are atomic by default.

The release can be a compiler only barrier followed by a simple store, but you do need an atomic RMW in the acquire path.

It is technically possible to implement a lock with just loads and stores (see Peterson lock[1]) even in the acquire path but it does require a #StoreLoad memory barrier even on Intel, which is as expensive as an atomic RMW so it is not worth it.

Edit:fixed barrier name typo.

[1] https://en.m.wikipedia.org/wiki/Peterson%27s_algorithm

> #StoreRelease memory barrier even on Intel

What is that? x86 is TSO...

Do you have an example of the full acquire and release sections in x86 assembly?

I'm sorry, I meant #StoreLoad.

Edit: here is a good (but long) article with a x86 Peterson Lock example (with an analysis of the critical race without the barrier):

https://bartoszmilewski.com/2008/11/05/who-ordered-memory-fe...

NetApp has a very interesting implementation RW spin lock inside the kernel. I tried optimizing to make it more fair for writers by introducing a single variable that would be check in each iteration.

The additional delay it added for checking the state resulted in deadlock timer triggering a dump! Hence, checking a variable is very expensive.

If you're interested in very efficient MCS style reader writer locks, check out this paper:

https://people.csail.mit.edu/mareko/spaa09-scalablerwlocks.p...