| Why not bake instruction alignment into the cache? When you can assume instructions will always be 32bit aligned, then you can simplify the icache read port and simplify the data path from the read port to the instruction decoder. Seems like it would be an oversight to not optimise for that. Though, I suspect that's easy problem to fix. The more pressing issue is what happens after the decoders. I understand this is a very wide design, decoding say 10 instructions per cycle. There might be a single 16bit instruction in the middle of that block 40 bytes, changing the alignment halfway though. To keep the same throughput, Qualcomm now need 20 decoders, one attempting to decode on every 16bit boundary. The extra decoders waste power and die space. Even worse, they somehow need to collect the first 10 valid instructions from those 20 decoders. I really doubt they have enough slack to do that inside the decode stage, or the next stage, so Qualcomm might find them selves adding an entire extra pipeline stage, (probably before decode, so they can have 20 simpler length decoders feeding into 10 full decoders on the next) just to deal with possible misaligned instructions. I don't know how flexible their design is, it's quite possible adding an entire extra pipeline stage is a big deal. Much bigger than just rewriting the instruction decoders to 32bit RISC-V. |
> I don't know how flexible their design is, it's quite possible adding an entire extra pipeline stage is a big deal. Much bigger than just rewriting the instruction decoders to 32bit RISC-V.
I'm sure it is legitimately simpler for them. I'm not sure we should bend over backwards and bring down the rest of the industry because they don't want to do it. Veyron, Tenstorrent were showing off high perf designs with RV-C.