| I think it's a bit strong to say that branch prediction "solves" the problem of x86's complex variable length encoding. But it certainly goes a long way to mitigating the handicap, and allowing x86 uarches to be very competitive with more RISC designs. I suspect that original handicap is a large part of the reason why x86 managed to become a dominant ISA, beating out all of its RISC derived competitors in the high-performance space. The requirement to continue supporting the legacy complex instruction encoding forced x86 uarch designers to go down the path of longer (but not too long) pipelines, powerful branch predictors and extremely large out-of-order windows. It wasn't the obvious approach back in the 90s/2000s, high-performance RISC designs of that era tended to stick with in-order superscalar pipelines. And when they did explore out-of-order designs, they were much more restrained with way smaller reorder-buffers. But in hindsight, it seems to have been the correct approach for high-performance micro arches. I can't help but notice that modern high-performance aarch64 cores from Apple and Arm have pipelines that look almost identical to the designs from AMD and Intel. Main difference is that they can get away with 8-wide instruction decoders instead of a uOP caches. > which, by the way can not be as compact as compiler-generated RISC instructions - you typically need 64 bits per internal instruction or similar Nah. According to Agner Fog's testing, Intel only allocates 32bits per uop. Immediates/Addresses larger than signed 16bits are handled with various complex mechanisms. If a 32bit/64bit value is still inside the -2¹⁵ to +2¹⁵ range, it can be squashed down. Space can be borrowed from other uOPs in the same cacheline that don't need to store an immediate. Otherwise, the uOP takes up multiple slots in the uOP cache. I suspect AMD also use a similar mechanism, because as you point out, caching un-encoded uOPs would be a huge waste of space. And there is no reason why you need to use the exact same encoding in both the pipeline and the uOP cache, it just needs to be significantly easier to decode than a full x86 instruction. |
My assumptions are largely based on annotated die shots, like this one of Rocket Lake (IIRC): https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_pr...
If they are correct, the uOP cache consumes at least as much silicon as the L1I cache, while they generally can hold fewer instructions.
Some napkin math: x86 instructions are 4 bytes long on average, so a 32KiB L1I can hold 32/4=8K instructions, while the uOP cache can hold 4K uOP instructions (how many uOPs does an x86 instruction translate to on average?). That would indicate that uOP:s require twice the silicon area to store compared to "raw" x86 instructions - or that the uOP cache is more advanced/complex than the L1I cache (which may very well be the case).
Also visible from the die shots: decoding and branch prediction are far from free.