These are synthetic benchmarks but it's quite significant in them.
From a different tweet:
> It's the total time for 32 threads each doing 10'000 lock+unlocks (on a 64C/128T threadripper). So, the numbers you quoted correspond to a lock+unlock operation going from 8.75ns to 2.45ns, under low contention.
> The numbers can vary a lot in different situations/hardware though.
I think the focus is on the synchronization and implementation choice based performance differences, https://twitter.com/m_ou_se/status/1526211117651050497 which are not super easy to characterize but come from much more than just removing an allocation.
> you're often going to be better off eliminating the Arc/Mutex anyway
Not always. Mutexes can be really fast (10-20ns), especially since they often optimistically spin, and Arc in Rust is (often) relatively low cost since you can hand out "free" refs without touching the atomic.
If removing the Arc/Mutex would require allocations the Arc/Mutex could easily be faster.
Notably, still worse than 0 ns. Ditto for Arc's refcounting and additional allocation. I'm not saying go on a crusade against Arc+Mutex here, but the easiest way to make effective use of modern multicore CPUS is to go to shared-nothing, independent data-per-thread designs (obviating Arc+Mutex). And if you aren't using Arc+Mutex, it's harder to accidentally share mutable state between threads.
I just think people seriously overestimate the cost of a mutex when implemented efficiently. Unlocking a mutex can be ~10-20x faster than fetching a value from main memory, or just a bit slower than a few integer operations. The way people talk about mutex operations you'd think that it's akin to hitting disk when it's actually a few orders of magnitude closer to hitting your L2 cache.
It gets a lot more expensive if you’re actually contending the mutex between threads; and if you’re not, why use a mutex? I agree the uncontended case is fast — it’s just not very useful.
There are a lot of scenarios where you're rarely contended but you cannot rule it out, so for correctness reasons you should use mutual exclusion but your measured performance in the real world essentially never cares about the contended case.
Modern fast mutexes are perfect for that, because their uncontended case is so good. This also inculcates the correct choice for the programmer, you should prefer to write code that is less often contended, not fight hard to get better contended performance at a cost of worse uncontended performance. Contention is bad even if your mutual exclusion primitive performs well.
But Mara measured across simulated workloads with varying contention and this fix improves them all to different extents.
Because it's an incredibly efficient, safe option for doing so. Lots of shared state is rarely contended. For example, imagine you have a 'Config' that gets updated periodically in the background, readers of that config only check for updates every 1 second, and you have 7 parallel readers (and 1 writer for an 8 core system).
A Mutex is a trivial way to solve that problem that will be extremely efficient.
Don't atomic operations trigger cache synchronisation in CPUs? Doesn't that affect performance negatively? That would mean even a non-contended mutex would affect performance negatively. I suspect it depends a lot on the specific workload (and maybe even what addresses data is stored at in memory), so I'd measure the specific case, but that's my a priori gut feeling.
The overhead of atomics is almost (if not entirely?) exclusively with regards to managing the caches in the CPU. Otherwise they're just normal bytes. Your CPU already has to do some cache management with regular bytes, so an atomic is only worse if there's contention (because that forces a flush).
The worst case for an atomic write is two additional cache line flushes, iirc.
These are synthetic benchmarks but it's quite significant in them.
From a different tweet:
> It's the total time for 32 threads each doing 10'000 lock+unlocks (on a 64C/128T threadripper). So, the numbers you quoted correspond to a lock+unlock operation going from 8.75ns to 2.45ns, under low contention.
> The numbers can vary a lot in different situations/hardware though.