| According to the label, that block contains both the uop cache AND the microcode ROM (which is actually at least partially RAM to allow for microcode updates). I guess it makes sense to group the two functions together, they are both alternative sources of uOPs that aren't from the instruction decoder. So really depends on what the balance is. If it was two or three of those memory cell blocks, I agree it's quite big. But if it's just one, it's actually quite small. Agner's findings are for the Sandybridge implementation. He says Haswell and Skylake share the same limitations, but doesn't look like he has done much research into the later implementations. The findings actually point to the uOP cache being much simpler in structure. The instruction cache has to support arbitrary instruction alignment and fetches that cross boundaries. The uOP cache has strict alignment requirements, it delivers one cache line per cycle and always delivers the entire line. If there aren't enough uops, then the rest of the cacheline is unused. > Also visible from the die shots: decoding and branch prediction are far from free. Yeah, it appears to be massive. And I get the impression that block is more branch prediction than decoding. Nothing is free in CPU design, it's just a massive balancing act. |
Yes, so it's hard to tell the exact size. We can only conclude that the uOP cache and the microcode ROM combined are about twice the size of the L1I cache (in terms of memory cells).
Another core die shot of the Zen 2 micro architecture is this (it appears to be correct as it is based on official AMD slides): https://forums.anandtech.com/proxy.php?image=https%3A%2F%2Fa...
Here uCode is in a separate area, and if we assume that the SRAM blocks in the area marked "Decode" represent the uOP cache, then we have:
* The uOP cache has the same physical size as the L1I cache
* uOP cache size = 4K uOPs
* L1I cache size = 32 KiB ~= 8K x86 instructions
If all this holds true (it's a big "if"), the number of uOP instructions that the uOP cache can hold is only half of the number of x86 instructions that the L1I cache can hold, and the size of uOP entries are in fact close to 32KiB / 4K uOPs = 64 bits each (given how similar the SRAM cells for the two caches are on the die shot I assume that they have the same density).
Furthermore I assume that one x86 instruction translates to more than one uOP instruction on average (e.g. instructions involving memory operands are cracked, and instructions with large immediates occupy more than one uOP slot - even the ARMv8 Vulcan microarchitecture sees a ~15% increase in instructions when cracking ARM instructions into uOPs: https://en.wikichip.org/wiki/cavium/microarchitectures/vulca... ), which would mean that the silicon area efficiency of the uOP cache compared to a regular L1I cache is even less than 50%.
Edit:
> Nothing is free in CPU design, it's just a massive balancing act.
Yup, and a large part of the x86 balancing act is to keep the x86 ISA alive and profit from the massive x86 ecosystem. Therefore Intel and AMD are prepared to sacrifice certain aspects, like power efficiency (and presumably performance too), and blow lots of energy on the x86 translation front end. That is a balancing act that designers of CPU:s with more modern ISA:s don't even have to consider.