They may need to access main memory, but the RMW operation don't happen over the memory bus. The processor appropriates the cache line just like any other memory access, and then operates atomically on the cache line.
The cache coherency protocol takes care of that. In other words the first part is just a memory load and can vary from 0 to a few hundred clock cycles, the second is local to the processor and has a more or less fixed cost. The worst-case execution time is completely dominated by the first part, the best case instead is dominated by the second.