|
|
|
|
|
by monocasa
967 days ago
|
|
I don't think anyone bakes instruction alignment into their caches since the early 2000s, and adding an extra bit to the branch predictors isn't that big of a deal. It's got to be the first or second stage of their front end right before the decoders. |
|
Though, I suspect that's easy problem to fix. The more pressing issue is what happens after the decoders. I understand this is a very wide design, decoding say 10 instructions per cycle.
There might be a single 16bit instruction in the middle of that block 40 bytes, changing the alignment halfway though. To keep the same throughput, Qualcomm now need 20 decoders, one attempting to decode on every 16bit boundary. The extra decoders waste power and die space.
Even worse, they somehow need to collect the first 10 valid instructions from those 20 decoders. I really doubt they have enough slack to do that inside the decode stage, or the next stage, so Qualcomm might find them selves adding an entire extra pipeline stage, (probably before decode, so they can have 20 simpler length decoders feeding into 10 full decoders on the next) just to deal with possible misaligned instructions.
I don't know how flexible their design is, it's quite possible adding an entire extra pipeline stage is a big deal. Much bigger than just rewriting the instruction decoders to 32bit RISC-V.