The correct term to use here is blocking. The hardware bus locks finish in a bounded amount of time and allow the other processors to eventually make progress.
That is a good distinction, but I'm not sure if it's always true. A peripheral device on a shared bus could still hold the lock for an undetermined time.
Yes. Even better than that, atomic instructions are usually completely local to a core. I think that the only interaction with with the coherency protocol is that a core is guaranteed to be able to hold a cache in exclusive mode long enough to execute an RMW (and even that it is not really required, but useful to guarantee forward progress).
Since NVLink2 and POWER9, even a GPU can issue atomics over the bus, which will be executed local to the CPU that owns this cacheline.
This is very useful in high-contention write-heavy workloads, like atomic counters or accumulators.
Yes, and the cache hierarchy ultimately depends on the memory bus. I suppose this bus, which may be shared with many other devices, doesn't always have bounded-time guarantee.
Even to main memory there is not necessarily a single memory bus. Intracore or even intrasocket synchronization need not (and usually doesn't) go through main memory anyway.
True, but some atomic instructions may need to access main memory to complete their operation. Whether shortcuts can be taken in most cases is not relevant for worst-case considerations.
They may need to access main memory, but the RMW operation don't happen over the memory bus. The processor appropriates the cache line just like any other memory access, and then operates atomically on the cache line.