Hacker News new | ask | show | jobs
by nathanielherman 2359 days ago
This experiment is a bit weird. If you look at https://github.com/matklad/lock-bench, this was run on a machine with 8 logical CPUs, but the test is using 32 threads. It's not that surprising that running 4x as many threads as there are CPUs doesn't make sense for spin locks.

I did a quick test on my Mac using 4 threads instead. At "heavy contention" the spin lock is actually 22% faster than parking_lot::Mutex. At "extreme contention", the spin lock is 22% slower than parking_lot::Mutex.

Heavy contention run:

  $ cargo run --release 4 64 10000 100
      Finished release [optimized] target(s) in 0.01s
      Running `target/release/lock-bench 4 64 10000 100`
  Options {
      n_threads: 4,
      n_locks: 64,
      n_ops: 10000,
      n_rounds: 100,
  }

  std::sync::Mutex     avg 2.822382ms   min 1.459601ms   max 3.342966ms  
  parking_lot::Mutex   avg 1.070323ms   min 760.52µs     max 1.212874ms  
  spin::Mutex          avg 879.457µs    min 681.836µs    max 990.38µs    
  AmdSpinlock          avg 915.096µs    min 445.494µs    max 1.003548ms  

  std::sync::Mutex     avg 2.832905ms   min 2.227285ms   max 3.46791ms   
  parking_lot::Mutex   avg 1.059368ms   min 507.346µs    max 1.263203ms  
  spin::Mutex          avg 873.197µs    min 432.016µs    max 1.062487ms  
  AmdSpinlock          avg 916.393µs    min 568.889µs    max 1.024317ms  
Extreme contention run:

  $ cargo run --release 4 2 10000 100
      Finished release [optimized] target(s) in 0.01s
      Running `target/release/lock-bench 4 2 10000 100`
  Options {
      n_threads: 4,
      n_locks: 2,
      n_ops: 10000,
      n_rounds: 100,
  }

  std::sync::Mutex     avg 4.552701ms   min 2.699316ms   max 5.42634ms   
  parking_lot::Mutex   avg 2.802124ms   min 1.398002ms   max 4.798426ms  
  spin::Mutex          avg 3.596568ms   min 1.66903ms    max 4.290803ms  
  AmdSpinlock          avg 3.470115ms   min 1.707714ms   max 4.118536ms  

  std::sync::Mutex     avg 4.486896ms   min 2.536907ms   max 5.821404ms  
  parking_lot::Mutex   avg 2.712171ms   min 1.508037ms   max 5.44592ms   
  spin::Mutex          avg 3.563192ms   min 1.700003ms   max 4.264851ms  
  AmdSpinlock          avg 3.643592ms   min 2.208522ms   max 4.856297ms
2 comments

The top comment opens up the concept of latency versus throughput. My interpretation is that this experiment is demonstrating that optimizing only for latency has consequences elsewhere in the system. Which is not surprising at all, but then again I spend a lot of time explaining unsurprising things.

I remember a sort of sea change in my thinking on technical books during a period where I tended to keep them at work instead of at home. I noticed a curious pattern in which ones were getting borrowed and by whom. Reading material isn't only useful if it has something new to me in it. It's also useful if it presents information I already know and agree with, in a convenient format. Possibly more useful, in fact.

If you only have 4 threads it is likely that all your CPUs are sharing caches and you won't see the real downside of the spinlock. They don't really fall apart until you have several sockets.
Note that I get a similar speedup with 6 and 8 threads on my Mac (which has 8 logical CPUs)