|
|
|
|
|
by yubblegum
23 days ago
|
|
Let's try this again: OP > The issue is this will likely go just as slow if not slower. The mere act of sharing the same 64-byte region of memory (a.k.a. cacheline) between multiple cores, causes the CPU internally to basically use a mutex, and chances are the CPU's internal mutexes aren't as good as the ones you've implemented in userspace. The claim by OP is that "chances are" that userspace mutexes are better than CPU's internal mutexes. So either h/w guys are (for a first) lagging s/w folks and using outdated approaches to creating a mutex in hardware, OR, we somehow must use an inferior approach when implementing a mutex in a CPU, OR, .. How is it possible that a hardware implementation of an algorithm could be slower than its software variant, and that in "userspace" and not even the kernel. |
|
MESI takes a non-trivial amount of time to run. It's usually mediated at the L3 cache, which is frequently clocked slower than the main core, so it's a non-trivial number of cycles for a MESI state transition to happen (think at least ~40 cycles).
Compare that with a cacheline that's already Exclusive on a core: Writing to this cacheline is a pure L1 cache write at ~1 cycle, no MESI involved, no L3 cache involved.
Note that the typical userspace space mutex (some memory location that's modified by compare-exchange) is implicitly relying on MESI whenever two cores race for the lock, or whenever a lock is release on one core and then taken on another, etc etc.
So if your userspace coherency can be handled via some sharding that avoids needing MESI to run, it can go much faster than relying on the CPU's "internal mutex" aka MESI.
To directly address "is it possible that a hardware implementation of an algorithm could be slower than its software variant": The point is that with RSEQ, you can write a _different_ algorithm that's faster than an implementation relying on hardware mediated MESI. Instead of having memory that's accessed for a bunch of different cores, there's a chunk of memory per core, and each core almost always writes only to it's allocated chunk of memory.