| If you have two threads on different cores that write to the same cacheline, the CPU has to enforce write ordering. The way it does this is for one of the cores to acquire a write lock on the cacheline. The actual implementation of this is via some variant of the MESI protocol that results in the cacheline on one core going to an Exclusive state, while copies of the cacheline on every other core becomes Invalid. MESI takes a non-trivial amount of time to run. It's usually mediated at the L3 cache, which is frequently clocked slower than the main core, so it's a non-trivial number of cycles for a MESI state transition to happen (think at least ~40 cycles). Compare that with a cacheline that's already Exclusive on a core: Writing to this cacheline is a pure L1 cache write at ~1 cycle, no MESI involved, no L3 cache involved. Note that the typical userspace space mutex (some memory location that's modified by compare-exchange) is implicitly relying on MESI whenever two cores race for the lock, or whenever a lock is release on one core and then taken on another, etc etc. So if your userspace coherency can be handled via some sharding that avoids needing MESI to run, it can go much faster than relying on the CPU's "internal mutex" aka MESI. To directly address "is it possible that a hardware implementation of an algorithm could be slower than its software variant": The point is that with RSEQ, you can write a _different_ algorithm that's faster than an implementation relying on hardware mediated MESI. Instead of having memory that's accessed for a bunch of different cores, there's a chunk of memory per core, and each core almost always writes only to it's allocated chunk of memory. |
Even without rseq(2) you're looking at a "different algorithm". For one, a proper userspace mutex allows the excluded threads to be descheduled and the core can be repurposed to run other tasks, rather than hammering away at the contended cache line.