Hacker News new | ask | show | jobs
by twic 2363 days ago
The author has an implicit definition of "faster" which it is important to be aware of.

The main use of spinlocks that i'm aware of is minimising latency in inter-processor communication.

That is, if you have a worker task which is waiting for a supervisor task to tell it to do something, then to minimise the time between the supervisor giving the order and the worker getting to work, use a spinlock.

For this to really work, you need to implement both tasks as threads pinned to dedicated cores, so they won't be preeempted. You will burn a huge amount of CPU doing this, and so it won't be "faster" from a throughput point of view. But the latency will be as low as it's possible to go.

5 comments

>The main use of spinlocks that i'm aware of is minimising latency in inter-processor communication.

The main use of spinlocks that I'm aware of is dealing with interrupt handlers in driver code. In this situation you generally can't go to sleep (sleeping with interrupts disabled is generally a good way of never waking up) so calling "mutex_lock" is simply out of the question. That's probably a niche use but that's literally the only situation I've ever used actual spin locks instead of mutexes, mainly for the reasons outlined by TFA.

I did some algorithmic/ high frequency trading and we were only spinning waiting for task and never, ever releasing to OS. But then the entire server was dedicated to one job and the job consisted of reacting to single stream of messages.

Most of the time I would say if lock performance impacts your application you are likely doing something wrong. There are many ways of solving typical synchronization problems without locks and if you need locks there are many ways to amortize costs.

Mutexes are not faster than spin locks the same way slower cars are not faster than fast cars. You might still crash in a fast car and be slower to the end of the race (to the supermarket) but that is just your failure to use the power you were given responsibly.

TFA? Tried to google but unsuccessful
I've seen lots of valid variants, including and probably not limited to:

The feature article ; the freaking article ; the friendly article ; the f$%^&*( article, c.f. RTFM.

I personally go with "The Fine Article" in my mind when I write it, although obviously all these alternatives are possible. I first encountered this initialism on slashdot back when it was still relevant, although I don't know if it was coined there.

And yeah, I believe that it derives from "RTFA" (read the fucking article) which would be what you told people who obviously commented without reading the article.

Not sure it’s the origin, but it used to be RTFM (read the fucking manual).
RTFM has been mostly superseded by LMGTFY.
Also read "read the fine manual"
The original, as documented in the hacker dictionary, is of course RTFM (read the f...ine manual).
The (Fuc|Frea)king Article. In a similar spirit to RTFM.
The aForementioned Article

the F is usually something else though ...

My OS professor in college told us "RTFM means read the manual; the 'f' is silent"
It’s the same F as in BFG. The gun, not the giant.
The Fine Article.
The (forementioned) article, I would imagine.
The Fabulous Article.
TFA is Slashdot jargon for "The F'ing Article".
TFA on Slashdot... Takes me back to the good old days of 2007...
And this only makes sense in a multi-core system (where the spinlock is held on another core). In a single core system if you are trying to take a contended lock in an interrupt you are stuck and can only fail to take it and hope to handle that gracefully.
And even that is only true for systems that don’t use interrupt threads.
Ya’ll should consider using atomic increment on separate cache lines instead of spinlocks. If you want to minimize latency to the bare minimum, atomic increment gives you two orders of magnitude measurable improvements over locks. https://lmax-exchange.github.io/disruptor/files/Disruptor-1....
They solve different problems. A lock-free datastructure can be much fast than one with locks, but they depend on exactly what operations you can do fast and atomically (for example, the LMAX queue length is limited to a power of 2 because there's no way to atomically increment an integer which will wrap at arbitrary values, and the modulo operator is too expensive for non-power of two values). A lock is a primitive which allows you to make arbitrary operations 'atomic' (though in these benchmarks generally a simple increment is used). A lock-free queue may be a good replacement for a queue using locks, but this is only in one application of locks.

Also, the disruptor design is based on a system where there is a 1:1 correspondance between threads and cores. If this is not the case it will likely still interact somewhat poorly with the OS's scheduler without any blocking at all, for the same reason as spinlocks (in fact, if you use the library you can customize this behaviour).

As far as I'm aware that's exactly how you would implement a spinlock.

Random Google search seems to validate that.

https://stackoverflow.com/questions/1383363/is-my-spin-lock-...

Compare-and-swap isn't quite the same as atomic increment. An atomic increment can't fail; it always increments. Whereas CAS will fail if a different thread has modified the value.

The highest performance design is to use a ringbuffer. To write to the ringbuffer, you atomic-increment the "claim" counter, giving you an index. Now take index modulo ringbuffer size. That's the slot to write to. After you're done writing to it, set your "publish" counter to the index.

Each writer has a "publish" counter, and the "claim" counter isn't allowed to move beyond the lowest "publish" counter modulo ringbuffer size.

Each reader uses a similar strategy: the reader has a current counter. You find the minimum of all publish counter values, then process each slot up to that value, and set your counter to that value. The "claim" counter isn't allowed to move past the minimum of all reader counters.

Hence, everyone is reading from / writing to separate cache lines, and there are no locks at all. The sole primitive is atomic increment, which can never fail. (The only failure condition is "one of the slots hasn't been processed yet" (i.e. the publish counter is <= claim counter minus ringbuffer size) at which point everybody waits for the slot to be processed.

You can wait using multiple strategies: a spin loop (which isn't the same as a spinlock because you're only reading a value in memory, not calling any atomic primitives), yielding CPU time via sched_yield(), or calling sleep. All strategies have tradeoffs. Calling sleep is the least CPU intensive, but highest latency. Vice-versa for spin loop.

Takeaway: there are no locks. Just increments.

From the architecture PoV it is exactly same thing if you think of it as implemented by LL/SC pair. This is how this is presented in most of literature and university courses. Then there is the purely practical issue of x86 not exposing LL/SC and instead having somewhat strict memory model and various lock-prefixed high-level instructions with wildly varying performance characteristics.
Is it?

I'd be interested to see benchmarks showing that spinlocks with CAS have similar throughput to a ringbuffer with atomic increment.

Note that with the ringbuffer approach, each reader can process many slots at once, since you're taking the minimum of all published slot numbers. If you last processed slot 3, and the minimum publish count is 9, you can process slot 4 through 9 without doing any atomic operations at all. The design guarantees safety with no extra work.

It's not a minor trick; it's one of the main reasons throughput is orders of magnitude higher.

Benchmarks: https://github.com/LMAX-Exchange/disruptor/wiki/Performance-...

Beyond that, the ringbuffer approach also solves the problem you'll always run into: if you use a queue, you have to decide the max size of the queue. It's very hard to use all available resources without allocating too much and swapping to disk, which murders performance. Real systems with queues have memory usage patterns that tend to fluctuate by tens of GB, whereas a fixed-size ringbuffer avoids the problem.

This is also the reason DPDK uses a highly-efficient ring buffer. I believe (old benchmarks) it's faster than LMAX.
I do not think x86 atomics are implemented as LL/SC internally. As a minimum they always guarantee forward progress: as soon the cacheline is acquired in exclusive mode (and the coherency protocol gurantees it happens in finite time), the load-op-write always happens and cannot be interrupted.

Also as far as I'm aware, at least on intel all atomic operations take pretty much exactly the same number of clock cyles (except for CAS and DCAS which are ~10% and 50% more expensive, IIRC)

That is exactly my point. Any x86 SMP platform since Pentium is built on the assumption that truly exclusive access is possible. For the shared FSB platforms that is trivially implemented by global LOCK# and K8/QPI simply has to somehow simulate same behavior on top of some switched NUMA fabric (and this is one of the reasons why x86 NUMA coherency protocols are somewhat wasteful, think global broadcasts, and incredibly complex).

For context: before Pentium with its glueless 2x2 SMP/redundancy support there were various approaches to shared memory x86 multiprocessors with wildly different memory coherence models. (And some of the “lets design a board with eight 80386” are the reason why Intel had designed i586 to be glueless and such systems are probably still used to this day, althought unsupported)

Also if both threads are pinned to separate cores and nothing else is supposed to run on those cores, it is pointless to use anything but spinlocks as there is no other thread that could better use the core (and probably you do not want the core to go to a low power syate waiting for an interrupt).
You're discounting energy use. This is a bad strategy on a battery powered device.
In addition to energy, power use is another reason; parking a core will allow it to cool down thermally, so that when it is put back in use (milli)seconds later, it can run at a higher clock speed for longer.
Intel has a low-power PAUSE instruction that is literally a ‘rep nop’. I assume Arm has one too.
That's not extremely low power compared to real low power states. The main advantage of PAUSE is the scheduling of the other hyperthread (if it exists) and maybe not generating a gratuitous L1 / MESI workload at a crazy rate (well if programmed correctly that should be quite cheap in lots of cases, but still...). To my knowledge this does not cut any clock, so the power economy is going to be minimal.
IIRC the mov imm, %ecx; rep nop sequences are somewhat special cased by modern architectures (and this fact is the only reason why you even would want to execute such code). On the other hand the energy savings are mostly negligible and it is simply an SMT-level equivalent of sched_yield()
Actually I heard that the last few generations of intel (from skylake) enter power state mode more aggressively with pause and the latency of getting out of a pause went up from tens of cycles to hundreds. No first hand testing though.
Yes, you wouldn't use this strategy on a battery powered device. It ia for very specialised applications.
> and nothing else is supposed to run on those cores

That's quite the corner case.

This is exactly the situation for a well-balanced parallel work queue. You want to start as many threads as there are cores and run them full tilt pulling work off the queue until it is empty. If you're running a large scale cluster that is dedicated to a particular task (e.g. like servicing a special kind of query, or encoding videos, rendering, etc), this is very common, or even a parallel Photoshop filter.
> This is exactly the situation for a well-balanced parallel work queue.

What if your work queues are running on a multitasking operating system that runs services? And what about a hypervisor?

For this technique you generally dedicate some core(s) to those miscellaneous threads and flag the rest as unscheduable unless a thread is specifically assigned to them.

If you’re not sharing cores between VMs it’s typical to do the same at the hypervisor layer.

This is the normal use case for any DPDK software. I think anyone involved in HPC or high-speed networking knows that this is pretty common.
Yes, in practice you have to dedicate the whole machine for a specific application, but the one thread per isolated core is a proven one for high performance/low latency applications.
Absolutely! spinlocks is an inter-processor mechanism and mutex inter-processes. Principal difference.
What is a use case for optimizing interprocessor latency?
Hello,

I wrote some software at an automated trading firm that would do the subset of "parsing ticks and maintaining order books" sufficient to map ticks to symbols, and no more work than that. My code was pinned to one core and spun waiting for an expensive network card to DMA some ticks into memory, then when ticks we cared about were found it would load them onto one or more queues. Each queue corresponded to a different thread spin waiting on its own pinned core.

So one use case is "transforming ticks into trading decisions slightly faster than you otherwise could"

Comment-OP here - this is also more or less my use case.

Whether there is a use case for spinlocks outside low-latency trading, i have no idea!

Software industrial process control, such as motors and let robotics. The longer latencies make the control loop less responsive or unstable. Microseconds granularity is very useful there.
a friend of mine recently told me how Windows schedules cpu-bound threads to different cores to prevent thermal throttling, so I now wonder if we were mucking things up by running our CPUs too hot
If you weren't pegging thermal max you shouldn't be seeing thermal limiting. Although you're probably right that cooling is something software people are more likely overlook than SRE/hardware/lab folks.
Liquid cooling is not unheard of in the low latency trading world.
High performance networking in general (NFV) frequently relies on cores dedicated to dpdk for the this.
Essentialy any true HPC workload. You can go pretty far by ignoring latency and doing some series of map and reduce stages on top of giant conceptual file with large-ish records. There are workloads where data for these partial parallel jobs are on the order of machine word or two and then you really care about communication latency between cores.
Check out DPDK/VPP and the LMAX Disruptor paper.
And do not forget Chronicle Queue.
Finishing a high priority job sooner.
Like TLB shootdown IPIs.