Hacker News new | ask | show | jobs
by loeg 297 days ago
Agner's instruction manual says "A LOCK prefix typically costs more than a hundred clock cycles," which might be dated but is directionally correct. (The atomic version is LOCK ADD.)

If you go to the CPU-specific tables, LOCK ADD is like 10-50 (Zen 3: 8, Zen 2: 20, Bulldozer: 55, lol) cycles latency vs the expected 1 cycle for regular ADD. And about 10 cycles on Intel CPUs.

So it can be starkly slower on some older AMD platforms, and merely ~10x slower on modern x86 platforms.

1 comments

On modern CPUs atomic adds are now reasonably fast, but only when they are uncontended. If the cache line the value is on has to bounce between cpus, that is usually +100ns (not cycles) or so.

Writing performant parallel code always means absolutely minimizing communication between threads.

Sure, but even the uncontended case is ~10x slower than regular ADD.