Hacker News new | ask | show | jobs
by gpderetta 2363 days ago
I do not think x86 atomics are implemented as LL/SC internally. As a minimum they always guarantee forward progress: as soon the cacheline is acquired in exclusive mode (and the coherency protocol gurantees it happens in finite time), the load-op-write always happens and cannot be interrupted.

Also as far as I'm aware, at least on intel all atomic operations take pretty much exactly the same number of clock cyles (except for CAS and DCAS which are ~10% and 50% more expensive, IIRC)

1 comments

That is exactly my point. Any x86 SMP platform since Pentium is built on the assumption that truly exclusive access is possible. For the shared FSB platforms that is trivially implemented by global LOCK# and K8/QPI simply has to somehow simulate same behavior on top of some switched NUMA fabric (and this is one of the reasons why x86 NUMA coherency protocols are somewhat wasteful, think global broadcasts, and incredibly complex).

For context: before Pentium with its glueless 2x2 SMP/redundancy support there were various approaches to shared memory x86 multiprocessors with wildly different memory coherence models. (And some of the “lets design a board with eight 80386” are the reason why Intel had designed i586 to be glueless and such systems are probably still used to this day, althought unsupported)

No, to implement x86 atomic semantics is the guarantee that a single cache line can be held in exlusive mode for a minimum lenght of time.

As forward progress is a pretty basic requirements, in practice even LL/SC platforms in practice do that, but is instead of having a single instruction with guaranteed forward progress you have to use a few special (but sometimes underspecified) sequences of instructions between the ll/sc pairs.

> in practice

FWIW RISC-V guarantees forward progress for reasonable uses:

> We mandate that LR/SC sequences of bounded length (16 consecutive static instructions) will eventually succeed, provided they contain only base ISA instructions other than loads, stores, and taken branches.

[sorry for the late reply]

what happens if those 16 instructions touch 16 different cache lines? I'm not an hardware expert (and even less on coherency protocols), but I think it would be extremely hard to make sure livelocking is avoided in all cases, short of having some extremely expensive and heavy handed global 'bus' lock fallback.

Reading and writing memory are excluded from the guarantee, aside from the LR/SC instructions that bookend a transaction. Inside the transaction you're basically limited to register-register ALU and aborting branches.