Hacker News new | ask | show | jobs
by BeeOnRope 528 days ago
We are interested in the software visible performance effects of pipelining. For small benchmarks that don't miss in the predictors or icache, this mostly means execution pipelining. That's the type of pipelining the article is discussing and the type of pipelining considered in instruction performance breakdowns considered by Agner, uops.info, simulated by LLVM-MCA, etc.

I.e., a lot of what you need to model for tight loops only depends on the execution latencies (as little as 1 cycle), and not on the full pipeline end-to-end latency (almost always more than 10 cycles on big OoO, maybe more than 20).

1 comments

Adding to this: the distinction is that an entire "instruction pipeline" can be [and often is] decomposed into many different pipelined circuits. This article is specifically describing the fact that some execution units are pipelined.

Those are different notions of pipelining with different motivations: one is motivated by "instruction-level parallelism," and the other is motivated by "achieving higher clock rates." If 64-bit multiplication were not pipelined, the minimum achievable clock period would be constrained by "how long it takes for bits to propagate through your multiplier."

> one is motivated by "instruction-level parallelism," and the other is motivated by "achieving higher clock rates."

Which are exactly the same thing? For exactly the same reasons?

Sure, you can focus your investigation on one or the other but that doesn't change what they are or somehow change the motivations for why it is being done.

And you can have a shorter clock period than your non-pipelined multiplier just fine. Just that other uses of that multiplier would stall in the meantime.