| It doesn't matter how optimised the length decoding is. Not doing it is still faster. For an 8-wide or 10-wide design, the propagation delays are getting too long to do it in all in single cycle. So you need the extra pipeline stage. The longer pipeline translates to more cycles wasted on branch mispredits. RISC-V code is only about 6-14% denser than Aarch64 [1], I'm really not sure the extra complexity is worth it. Especially since Aarch64 still ends up with a lower instruction count, so it will be faster whenever you are decode limited instead of icache limited. > Adding complexity to the I$ hasn't even made sense for x86 in two decades Hang on. Limiting the Icache to only 32bit aligned access actually simplifies it. And since the NUVIA core was originally an aarch64 core, why wouldn't they optimise for hardcoded 32bit alignment and get a slightly smaller Icache? [1] https://www.bitsnbites.eu/cisc-vs-risc-code-density/ |
Even x86 only reads 16 or 32 byte aligned fields out of the I$, then shifts them. There's not extra I$ complexity. You still have to do that shift at some point, in case you don't jump 32 byte aligned address. You also ideally don't want to only hit peak decode bandwidth starting on aligned 32 byte program counters, so that whole shift register thing is pretty much a requirement. And that's where most of the propagation delays are.
> RISC-V code is only about 6-14% denser than Aarch64 [1], I'm really not sure the extra complexity is worth it. Especially since Aarch64 still ends up with a lower instruction count, so it will be faster whenever you are decode limited instead of icache limited.
There's heavy use of fusion, and fwiw, the M1 also heavily fuses into micro ops too (and I'm sure the AArch64 morph of NUVIA's cores do too).