| > According to the label, that block contains both the uop cache AND the microcode ROM Yes, so it's hard to tell the exact size. We can only conclude that the uOP cache and the microcode ROM combined are about twice the size of the L1I cache (in terms of memory cells). Another core die shot of the Zen 2 micro architecture is this (it appears to be correct as it is based on official AMD slides): https://forums.anandtech.com/proxy.php?image=https%3A%2F%2Fa... Here uCode is in a separate area, and if we assume that the SRAM blocks in the area marked "Decode" represent the uOP cache, then we have: * The uOP cache has the same physical size as the L1I cache * uOP cache size = 4K uOPs * L1I cache size = 32 KiB ~= 8K x86 instructions If all this holds true (it's a big "if"), the number of uOP instructions that the uOP cache can hold is only half of the number of x86 instructions that the L1I cache can hold, and the size of uOP entries are in fact close to 32KiB / 4K uOPs = 64 bits each (given how similar the SRAM cells for the two caches are on the die shot I assume that they have the same density). Furthermore I assume that one x86 instruction translates to more than one uOP instruction on average (e.g. instructions involving memory operands are cracked, and instructions with large immediates occupy more than one uOP slot - even the ARMv8 Vulcan microarchitecture sees a ~15% increase in instructions when cracking ARM instructions into uOPs: https://en.wikichip.org/wiki/cavium/microarchitectures/vulca... ), which would mean that the silicon area efficiency of the uOP cache compared to a regular L1I cache is even less than 50%. Edit: > Nothing is free in CPU design, it's just a massive balancing act. Yup, and a large part of the x86 balancing act is to keep the x86 ISA alive and profit from the massive x86 ecosystem. Therefore Intel and AMD are prepared to sacrifice certain aspects, like power efficiency (and presumably performance too), and blow lots of energy on the x86 translation front end. That is a balancing act that designers of CPU:s with more modern ISA:s don't even have to consider. |
I found annotated die shots of Zen 3 and Zen 4 that pretty much confirm the op cache: https://locuza.substack.com/p/zen-evolution-a-small-overview
Pretty strong evidence that AMD are using a much simpler encoding scheme with roughly 64bits per uop. Also, That uop cache on Zen 4 is starting to look ridiculously large.
But that does give us a good idea how big the microcode rom is. If we go back to the previous intel die shot with its combined microcode rom + uop cache, it appears intel's uop cache is actually quite small thanks to their better encoding.
> Furthermore I assume that one x86 instruction translates to more than one uOP instruction on average (e.g. instructions involving memory operands are cracked
I suspect it's not massively higher one uop per instruction. Remember, the uop cache is in the fused-uop domain (so before memory cracking) and instruction fusion can actually squash some instructions pairs into a single uop.
The bigger hinderance will be any rules that prevent every uop slot from being filled. Intel appears to have many such rules (at least for Sandybridge/Haswell/Skylake)
> and blow lots of energy on the x86 translation front end
TBH, we have no idea how big the x86 tax is. We can't just assume the difference in performance per watt between the average x86 design and average high performance aarch64 design is entirely caused by the x86 tax.
Intel and AMD simply aren't incentivised to optimise their designs for low power consumption as their cores simply aren't used in mobile phones where ultra low power consumption is absolutely critical.