It gets a lot more expensive if you’re actually contending the mutex between threads; and if you’re not, why use a mutex? I agree the uncontended case is fast — it’s just not very useful.
There are a lot of scenarios where you're rarely contended but you cannot rule it out, so for correctness reasons you should use mutual exclusion but your measured performance in the real world essentially never cares about the contended case.
Modern fast mutexes are perfect for that, because their uncontended case is so good. This also inculcates the correct choice for the programmer, you should prefer to write code that is less often contended, not fight hard to get better contended performance at a cost of worse uncontended performance. Contention is bad even if your mutual exclusion primitive performs well.
But Mara measured across simulated workloads with varying contention and this fix improves them all to different extents.
Because it's an incredibly efficient, safe option for doing so. Lots of shared state is rarely contended. For example, imagine you have a 'Config' that gets updated periodically in the background, readers of that config only check for updates every 1 second, and you have 7 parallel readers (and 1 writer for an 8 core system).
A Mutex is a trivial way to solve that problem that will be extremely efficient.
Modern fast mutexes are perfect for that, because their uncontended case is so good. This also inculcates the correct choice for the programmer, you should prefer to write code that is less often contended, not fight hard to get better contended performance at a cost of worse uncontended performance. Contention is bad even if your mutual exclusion primitive performs well.
But Mara measured across simulated workloads with varying contention and this fix improves them all to different extents.