| Overclocking headroom means you have timing slack, and timing slack means you have faster ~= leakier circuits than necessary, or stages which aren't filled with work which is also an inefficiency. I expect Apple has an extreme focus on power efficiency and especially idle / leakage power, much more than Intel considering the core basically the same as they use in their phones. They also have a different approach to turbo / dvfs. So I would expect M1 to actually be a lot tighter than Intel and not have so much OC headroom. Obviously you can buy timing with voltage to some degree, so there would be something there probably. Modern nodes are running into more problems with voltage induced breakdown though so the OC limit looks very different to what you can ship in a product. Has anyone measured M1's VDD? > AMD chips are doing similar clocks to Intel on TSMC N7, so Apple could (but won't) have a chip running way higher than the clocks they are currently shipping with. Not their existing microarchitecture though. They do nearly 2x the work per clock as AMD chips which necessitates more logic per stage. Getting a microarchitectural edge means making less logic do more work and it's very possible Apple have some edge there, it just wouldn't be near 2x IMO. The silicon technology of course plays into it, but when you look at how fast individual transistors and the shortest poly to connect them can switch, speeds over 100GHz have been possible on 90nm. Today's cutting edge is probably over 200GHz (e.g., search ring oscillator). So it's not a fundamental switching speed limit of the tech that gets you. I would say Apple could probably redo the physical design and synthesis work and minimal logic changes to target a faster and leakier device that's not suitable for phones but might be a little fairer comparison. It wouldn't put it at a 5-6GHz frequency, but could easily be enough to re-take these benchmarks and still be ahead on efficiency. |
Not all work is created equal. Decoders on Arm are definitely parallel (= don't have more serial logic for wider decoder) compared to the variable length decode x86 is stuck with. And your backend ports are also parallel (although maybe scheduling isn't?). The only places where wider always means more logic per stage is caches - register file, L1/L2/L3, BTB. For example Apple managed to work some magic with a 3 cycle 192kB L1. AMD and Intel are at 4 and 5 cycles for much smaller L1's. Part of the reason for that is probably because Apple doesn't need to hit 5GHz and can afford more logic per stage.
And in any case, it's very likely you could just shove more voltage through the chip and get it to clock higher, since the current 3.2GHz is very far from what we know TSMC N7/5 can do. I don't think you'd need a rework unless Apple wanted to target 4.5+GHz.