Hacker News new | ask | show | jobs
by tbrownaw 4759 days ago
It seems to me that if you've got double the clock, the likely explanation is that half the transistors are switching per clock

Suppose CPU A has an adder, that takes one clock cycle to run an add instruction. When two registers are being added, the instruction goes thru the entire adder in one clock cycle and affects on average some % of the transistors.

Suppose CPU B has a pipelined adder that takes two clock cycles to run an add instruction. When two registers are being added, the instruction goes thru half of the adder in one cycle, and the other half in the next cycle, and affects about half of that same % of the transistors each time. BUT! This is a pipelined adder, and doesn't just do one instruction at a time. During the first cycle, when our instruction is in the first part of the adder, some other add instruction is still going thru the second part of the adder and affecting the other half of whatever % of the transistors. And during the second cycle of our instruction, the next instruction is going thru the first half. So even tho any one instruction only affects half of the adder at a time, the entire adder still gets affected every clock cycle.

1 comments

In that example, CPU B's adder can also be clocked twice as fast. If so, it's getting twice the work done and using twice the power (ignoring cache misses and the like for the moment). If it's clocked the same as A, it's performance and power usage will be almost the same as A.

Roughly speaking, power used = transistors switching per unit time. Performance should also follow that pretty closely, depending on the efficiency of the design. At some level, you should be able to look at any instruction and find a corresponding number of transistors that need to switch for it to execute.

Deep pipelining keeps more silicon active at any given time, increasing both performance and power consumption. Because of cache misses and the like, efficiency will drop somewhat. Double the stages also doesn't quite equal double the switches per time, for various reasons. Therefore, deeper pipelines = worse performance per watt but better performance per dollar (not sure how well that'll hold in ridiculous cases like Prescott).

From what I heard, Bulldozer only has one more stage than Haswell (15 vs. 14, don't quote me on that) - not nearly enough to account for the differences we see between them.

What I'm noting is that there are many, many more factors at play than just pipelining. In the case of Bulldozer, I've been hearing quite a bit about minor parts that they found needed more work, most notably branch prediction. It sounds like they've got lots of things that will improve performance with no power or die size downsides. The number I saw bandied about for Steamroller was a 30% performance increase. I have some trouble believing it's quite that big, but if they pull it off, that will be an amazing chip for being 32nm. It hints to me that the macroscale architecture is A-OK, and they just screwed up some small but important things.