|
|
|
|
|
by titzer
2023 days ago
|
|
Lots of other comments point out the vertical integration. For raw single-thread performance: 1. ARM64 is a fixed-width instruction set, so their frontend can decode more instructions in parallel. 2. They got one honking monster of an out-of-order execution engine. (630 entries), which feed: 3. 16 execution ports. |
|
I think I understand 1) as since they know the width they can more accurately divide the instructions to more parallel executers (whatever they are - the execution ports?)
2) I believe this allows more "pre-work" to get done before it's actually needed, but then the "pre-work" just chills until
3) these things do the work, and there an abnormally high amount of them?
p.s. Any noob friendly reading is also appreciated!