The cache coherency protocol takes care of that. In other words the first part is just a memory load and can vary from 0 to a few hundred clock cycles, the second is local to the processor and has a more or less fixed cost. The worst-case execution time is completely dominated by the first part, the best case instead is dominated by the second.