Hacker News new | ask | show | jobs
by ajross 535 days ago
> I don't think anyone is talking about "fetch, decode, operate, retire" pipelining (though that is certainly called pipelinig): only pipelining within the execution of a instruction that takes multiple cycles just to execute (i.e., latency from input-ready to output-ready).

I'm curious what you think the distinction is? Those statements are equivalent. The circuit implementing "an instruction" can't work in a single cycle, so you break it up and overlap sequentially issued instructions. Exactly what they do will be different for different hardware, sure, clearly we've moved beyond the classic four stage Patterson pipeline. But that doesn't make it a different kind of pipelining!

2 comments

We are interested in the software visible performance effects of pipelining. For small benchmarks that don't miss in the predictors or icache, this mostly means execution pipelining. That's the type of pipelining the article is discussing and the type of pipelining considered in instruction performance breakdowns considered by Agner, uops.info, simulated by LLVM-MCA, etc.

I.e., a lot of what you need to model for tight loops only depends on the execution latencies (as little as 1 cycle), and not on the full pipeline end-to-end latency (almost always more than 10 cycles on big OoO, maybe more than 20).

Adding to this: the distinction is that an entire "instruction pipeline" can be [and often is] decomposed into many different pipelined circuits. This article is specifically describing the fact that some execution units are pipelined.

Those are different notions of pipelining with different motivations: one is motivated by "instruction-level parallelism," and the other is motivated by "achieving higher clock rates." If 64-bit multiplication were not pipelined, the minimum achievable clock period would be constrained by "how long it takes for bits to propagate through your multiplier."

> one is motivated by "instruction-level parallelism," and the other is motivated by "achieving higher clock rates."

Which are exactly the same thing? For exactly the same reasons?

Sure, you can focus your investigation on one or the other but that doesn't change what they are or somehow change the motivations for why it is being done.

And you can have a shorter clock period than your non-pipelined multiplier just fine. Just that other uses of that multiplier would stall in the meantime.

Independently scheduled and queued execution phases are qualitatively different from a fixed pipeline.
An OoO design is qualitatively different from an in-order one because of renaming and dynamic scheduling, but the pipelining is essentially the same and for the same reasons.