| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by yubblegum 23 days ago

Let's try this again:

OP > The issue is this will likely go just as slow if not slower. The mere act of sharing the same 64-byte region of memory (a.k.a. cacheline) between multiple cores, causes the CPU internally to basically use a mutex, and chances are the CPU's internal mutexes aren't as good as the ones you've implemented in userspace.

The claim by OP is that "chances are" that userspace mutexes are better than CPU's internal mutexes. So either h/w guys are (for a first) lagging s/w folks and using outdated approaches to creating a mutex in hardware, OR, we somehow must use an inferior approach when implementing a mutex in a CPU, OR, ..

How is it possible that a hardware implementation of an algorithm could be slower than its software variant, and that in "userspace" and not even the kernel.

3 comments

mota7 23 days ago

If you have two threads on different cores that write to the same cacheline, the CPU has to enforce write ordering. The way it does this is for one of the cores to acquire a write lock on the cacheline. The actual implementation of this is via some variant of the MESI protocol that results in the cacheline on one core going to an Exclusive state, while copies of the cacheline on every other core becomes Invalid.

MESI takes a non-trivial amount of time to run. It's usually mediated at the L3 cache, which is frequently clocked slower than the main core, so it's a non-trivial number of cycles for a MESI state transition to happen (think at least ~40 cycles).

Compare that with a cacheline that's already Exclusive on a core: Writing to this cacheline is a pure L1 cache write at ~1 cycle, no MESI involved, no L3 cache involved.

Note that the typical userspace space mutex (some memory location that's modified by compare-exchange) is implicitly relying on MESI whenever two cores race for the lock, or whenever a lock is release on one core and then taken on another, etc etc.

So if your userspace coherency can be handled via some sharding that avoids needing MESI to run, it can go much faster than relying on the CPU's "internal mutex" aka MESI.

To directly address "is it possible that a hardware implementation of an algorithm could be slower than its software variant": The point is that with RSEQ, you can write a _different_ algorithm that's faster than an implementation relying on hardware mediated MESI. Instead of having memory that's accessed for a bunch of different cores, there's a chunk of memory per core, and each core almost always writes only to it's allocated chunk of memory.

link

khuey 23 days ago

> The point is that with RSEQ, you can write a _different_ algorithm that's faster than an implementation relying on hardware mediated MESI.

Even without rseq(2) you're looking at a "different algorithm". For one, a proper userspace mutex allows the excluded threads to be descheduled and the core can be repurposed to run other tasks, rather than hammering away at the contended cache line.

link

yubblegum 23 days ago

Thank you.

link

Taniwha 23 days ago

The cache coherency protocols that sit between the CPUs and DRAM always essentially "use a mutex": when a cpu wants to write it broadcasts to all the other CPUs and either gets the latest copy from whoever wrote it last and shoots down any read-only copies in other CPUs or reads it from DRAM (or converts a read only copy to be writeable)

This happens on every memory access, so the thing you want to avoid is ping ponging writeable cache lines between CPUs (especially before you have a chance to actually write it) - LL/SC instructions sit on top of these protocols and allow instructions to tell is a cache line had been "stolen" before you have a chance to write it

link

za_creature 23 days ago

I'm not a systems guy, but you're probably right. A mutex requires an atomic variable which triggers the same cache coherency mechanism under contention.

What Justine probably meant was an apples to oranges comparison: in userspace, you can sometimes add constraints that allow for a faster algorithm whereas the cpu generally has to assume the worst.

That being said, if you look through this thread you'll find similar issues with OP's phrasing. A fine engineer / hacker, but too much 4chan troll in my book.

link