| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by throwawaylinux 1610 days ago

Overclocking headroom means you have timing slack, and timing slack means you have faster ~= leakier circuits than necessary, or stages which aren't filled with work which is also an inefficiency.

I expect Apple has an extreme focus on power efficiency and especially idle / leakage power, much more than Intel considering the core basically the same as they use in their phones. They also have a different approach to turbo / dvfs. So I would expect M1 to actually be a lot tighter than Intel and not have so much OC headroom.

Obviously you can buy timing with voltage to some degree, so there would be something there probably. Modern nodes are running into more problems with voltage induced breakdown though so the OC limit looks very different to what you can ship in a product. Has anyone measured M1's VDD?

> AMD chips are doing similar clocks to Intel on TSMC N7, so Apple could (but won't) have a chip running way higher than the clocks they are currently shipping with.

Not their existing microarchitecture though. They do nearly 2x the work per clock as AMD chips which necessitates more logic per stage. Getting a microarchitectural edge means making less logic do more work and it's very possible Apple have some edge there, it just wouldn't be near 2x IMO.

The silicon technology of course plays into it, but when you look at how fast individual transistors and the shortest poly to connect them can switch, speeds over 100GHz have been possible on 90nm. Today's cutting edge is probably over 200GHz (e.g., search ring oscillator). So it's not a fundamental switching speed limit of the tech that gets you.

I would say Apple could probably redo the physical design and synthesis work and minimal logic changes to target a faster and leakier device that's not suitable for phones but might be a little fairer comparison. It wouldn't put it at a 5-6GHz frequency, but could easily be enough to re-take these benchmarks and still be ahead on efficiency.

1 comments

e4e78a06 1610 days ago

> They do nearly 2x the work per clock as AMD chips which necessitates more logic per stage

Not all work is created equal. Decoders on Arm are definitely parallel (= don't have more serial logic for wider decoder) compared to the variable length decode x86 is stuck with. And your backend ports are also parallel (although maybe scheduling isn't?). The only places where wider always means more logic per stage is caches - register file, L1/L2/L3, BTB. For example Apple managed to work some magic with a 3 cycle 192kB L1. AMD and Intel are at 4 and 5 cycles for much smaller L1's. Part of the reason for that is probably because Apple doesn't need to hit 5GHz and can afford more logic per stage.

And in any case, it's very likely you could just shove more voltage through the chip and get it to clock higher, since the current 3.2GHz is very far from what we know TSMC N7/5 can do. I don't think you'd need a rework unless Apple wanted to target 4.5+GHz.

link

throwawaylinux 1610 days ago

> Not all work is created equal.

> And in any case, it's very likely you could just shove more voltage through the chip and get it to clock higher,

Yes.

> since the current 3.2GHz is very far from what we know TSMC N7/5 can do.

N7/N5 can "do" 200GHz. 90nm could do 100GHz. The limit a device can do depends most highly on the logic.

> I don't think you'd need a rework unless Apple wanted to target 4.5+GHz.

3.2->4.4? I doubt it with any reasonable voltage that could actually ship in a device. Very hard to predict these things unless you've at least got basic shmoo plots and things like that in front of you.

link

moonchild 1610 days ago

> Decoders on Arm are definitely parallel (= don't have more serial logic for wider decoder) compared to the variable length decode x86 is stuck with

x86 decode is parallel too

link

e4e78a06 1609 days ago

It is parallel in the sense that you can decode 4-6 instructions in parallel, yes. It is not parallel in the sense that variable length decoding requires each of your decoders to talk to the other ones to coordinate on instruction length boundaries, which means there is going to be a lot of serial logic in your decoder circuit.

link

throwawaylinux 1609 days ago

It doesn't, if your L1$ predecodes at fill time and stores instruction length.

Something, somewhere does have to do a serial length decoding of course. But when you look at the L2 access latency and throughput (which is the minimum L1 fill latency), it's clear you could afford to do that part of the decode over more cycles.

New designs are not just predecoding lengths but entire uops now into the first level instruction cache which is the same concept they just call it a L0 and L1 rather than L1 and L2.

link