Hacker News new | ask | show | jobs
by phire 1675 days ago
The stack engine only handles the adjustment of the stack pointer, converting the push and pop to regular load/store uops.

But the store-then-load pattern is optimised by the store buffers, which do store-forwarding to forward the result of the in-flight store to the load without having to go though L1 cache.

It's not quite free, you still have to complete the store (the cpu can't assume optimising away a stack push is safe, unless it's actually overwritten) and there is still a 4 cycle latency, but that probably isn't an issue due to out-of-order execution.

1 comments

It gets more "free" once you have the zero-latency loads introduced in Zen 2 and the load can be speculatively replaced with a register move if the store is close and obvious enough
How can you have a zero latency load?
Similar way register movs can have zero latency - the output is renamed from the register source of the corresponding store. Which takes the load out of the dependency chain, effectively having zero latency so long as the correct store was identified.