You're misreading the benchmark, that's 6ms for 10,000 lock/unlocks per thread, 320,000 lock/unlocks total. In other words 0.6 microseconds per thread per lock.
That's still unreasonably high, isn't it? Even a Go sync.Mutex, not exactly a hot-rod implementation, can be acquired and released in < 50ns on the garbage hardware I have before me.
On Intel (and probably very similar on AMD) the cost of a completely uncontented, cache hit, simple spin lock acquisition is ~20 clock cycles while the release is almost free.