| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by titzer 2023 days ago

I think the M1 chip finally proves the inherent design superiority of RISC over CISC. For years, Intel stayed ahead of all other competitors by having the best process, clockspeeds, and the most advanced out-of-order execution. By internally decoding CISC to RISC, Intel could feed a large number of execution ports to extract maximum ILP. They had to spend gobs of silicon for that: complex decoding, made worse by the legacy of x86's encodings, complex branch prediction, and all that OOE takes a lot of real estate. They could do that because they were ahead of everyone else in transistor count.

But in the end all of that went bye bye when Intel lost the process edge and therefore lost the transistor count advantage. Now with the 5nm process others can field gobs of transistors and they don't have the x86 frontend millstone around their necks. So ARM64 unlocked a lot of frontend bandwidth to feed even more execution ports. And with the transistor budget so high, 8 massive cores could be put on die.

Now, people have argued for decades that the instruction density of CISC is a major advantage, because that density would make better use of I-cache and bandwidth. But it looks like decode bandwidth is the thing. That, and RISC usually requires aligned instructions, which means that branch density cannot be too high, and branch prediction data structures are simpler and more effective. (Intel still has weird slowdowns if you have too many branches in a cache line).

It seems frontend effects are real.

2 comments

rayiner 2023 days ago

ARM64 can't be that easy to decode, since ARM's recent high-performance cores (A78, X1) decode ARM64 instructions into MOPS and feature a MOP cache: https://www.anandtech.com/show/15813/arm-cortex-a78-cortex-x.... And we don't know that the M1 doesn't do that. Also, even on Zen 2, the entire decode section is still a fraction of the size of say the vector units: https://forums.anandtech.com/threads/annotated-hi-res-core-d.... And the cores themselves take up a small amount of the die space on a modern CPU: https://cdn.mos.cms.futurecdn.net/m22pkncJXbqSMVisfrWcZ5-102....

A bet doing 8-wide x86 decoding would be tough, but once you've got a micro-up cache, it's doable so long as you have a cache hit. Zen 3 is 8-wide the 95% of the time you hit the micro-up cache.

The real question is how does Apple keep that thing fed? An 8-wide decoder is pointless if most of the time you've got 6 empty pipelines: https://open.hpi.de/courses/parprog2014/items/aybclrPgY4nPyY... (discussing ILP wall). M1 outperforming Zen 3 by 20% on the SPEC GCC benchmark, at 1/3 lower clocks-speed. That's 80% more ILP than an Zen 3, which is itself a large advance in ILP.

link

titzer 2023 days ago

> ARM64 can't be that easy to decode, since ARM's recent high-performance cores (A78, X1) decode ARM64 instructions into MOPS and feature a MOP cache

My point was more about that the fixed-width instructions allow trivial parallel decoding while x86 requires predicting the length of instructions or just brute forcing all possible offsets, which is costly.

> The real question is how does Apple keep that thing fed?

That's why there's such an enormous reorder buffer. It's so that there's a massive amount of potential work out there for execution ports to pick up and do. Of course, that's all wasted when you have a branch mispredict. I haven't seen anything specific about M1's branch prediction, but it is clearly top-notch.

link

cma 2023 days ago

Other things helping are forced >=16Kb page sizes, and massive L1 caches (M1 has 4X Zen 3's L1 data cache and 3X the L1 instruction cache; how much of that cache size is enabled by new process node and larger page sizes vs just lack of x86 decode I don't know).

link

bcrl 2021 days ago

L1 cache size is driven by the target clock frequency. Apple is not aiming for 5+GHz, whereas both Intel and AMD cores can turbo above 5GHz these days.

link