Hacker News new | ask | show | jobs
by e4e78a06 1610 days ago
> They do nearly 2x the work per clock as AMD chips which necessitates more logic per stage

Not all work is created equal. Decoders on Arm are definitely parallel (= don't have more serial logic for wider decoder) compared to the variable length decode x86 is stuck with. And your backend ports are also parallel (although maybe scheduling isn't?). The only places where wider always means more logic per stage is caches - register file, L1/L2/L3, BTB. For example Apple managed to work some magic with a 3 cycle 192kB L1. AMD and Intel are at 4 and 5 cycles for much smaller L1's. Part of the reason for that is probably because Apple doesn't need to hit 5GHz and can afford more logic per stage.

And in any case, it's very likely you could just shove more voltage through the chip and get it to clock higher, since the current 3.2GHz is very far from what we know TSMC N7/5 can do. I don't think you'd need a rework unless Apple wanted to target 4.5+GHz.

2 comments

> Not all work is created equal.

> And in any case, it's very likely you could just shove more voltage through the chip and get it to clock higher,

Yes.

> since the current 3.2GHz is very far from what we know TSMC N7/5 can do.

N7/N5 can "do" 200GHz. 90nm could do 100GHz. The limit a device can do depends most highly on the logic.

> I don't think you'd need a rework unless Apple wanted to target 4.5+GHz.

3.2->4.4? I doubt it with any reasonable voltage that could actually ship in a device. Very hard to predict these things unless you've at least got basic shmoo plots and things like that in front of you.

> Decoders on Arm are definitely parallel (= don't have more serial logic for wider decoder) compared to the variable length decode x86 is stuck with

x86 decode is parallel too

It is parallel in the sense that you can decode 4-6 instructions in parallel, yes. It is not parallel in the sense that variable length decoding requires each of your decoders to talk to the other ones to coordinate on instruction length boundaries, which means there is going to be a lot of serial logic in your decoder circuit.
It doesn't, if your L1$ predecodes at fill time and stores instruction length.

Something, somewhere does have to do a serial length decoding of course. But when you look at the L2 access latency and throughput (which is the minimum L1 fill latency), it's clear you could afford to do that part of the decode over more cycles.

New designs are not just predecoding lengths but entire uops now into the first level instruction cache which is the same concept they just call it a L0 and L1 rather than L1 and L2.