| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by phire 883 days ago

I think it's a bit strong to say that branch prediction "solves" the problem of x86's complex variable length encoding. But it certainly goes a long way to mitigating the handicap, and allowing x86 uarches to be very competitive with more RISC designs.

I suspect that original handicap is a large part of the reason why x86 managed to become a dominant ISA, beating out all of its RISC derived competitors in the high-performance space. The requirement to continue supporting the legacy complex instruction encoding forced x86 uarch designers to go down the path of longer (but not too long) pipelines, powerful branch predictors and extremely large out-of-order windows.

It wasn't the obvious approach back in the 90s/2000s, high-performance RISC designs of that era tended to stick with in-order superscalar pipelines. And when they did explore out-of-order designs, they were much more restrained with way smaller reorder-buffers.

But in hindsight, it seems to have been the correct approach for high-performance micro arches. I can't help but notice that modern high-performance aarch64 cores from Apple and Arm have pipelines that look almost identical to the designs from AMD and Intel. Main difference is that they can get away with 8-wide instruction decoders instead of a uOP caches.

> which, by the way can not be as compact as compiler-generated RISC instructions - you typically need 64 bits per internal instruction or similar

Nah. According to Agner Fog's testing, Intel only allocates 32bits per uop.

Immediates/Addresses larger than signed 16bits are handled with various complex mechanisms. If a 32bit/64bit value is still inside the -2¹⁵ to +2¹⁵ range, it can be squashed down. Space can be borrowed from other uOPs in the same cacheline that don't need to store an immediate. Otherwise, the uOP takes up multiple slots in the uOP cache.

I suspect AMD also use a similar mechanism, because as you point out, caching un-encoded uOPs would be a huge waste of space. And there is no reason why you need to use the exact same encoding in both the pipeline and the uOP cache, it just needs to be significantly easier to decode than a full x86 instruction.

1 comments

mbitsnbites 883 days ago

I'll have to read up on Agner's findings.

My assumptions are largely based on annotated die shots, like this one of Rocket Lake (IIRC): https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_pr...

If they are correct, the uOP cache consumes at least as much silicon as the L1I cache, while they generally can hold fewer instructions.

Some napkin math: x86 instructions are 4 bytes long on average, so a 32KiB L1I can hold 32/4=8K instructions, while the uOP cache can hold 4K uOP instructions (how many uOPs does an x86 instruction translate to on average?). That would indicate that uOP:s require twice the silicon area to store compared to "raw" x86 instructions - or that the uOP cache is more advanced/complex than the L1I cache (which may very well be the case).

Also visible from the die shots: decoding and branch prediction are far from free.

link

phire 882 days ago

According to the label, that block contains both the uop cache AND the microcode ROM (which is actually at least partially RAM to allow for microcode updates). I guess it makes sense to group the two functions together, they are both alternative sources of uOPs that aren't from the instruction decoder.

So really depends on what the balance is. If it was two or three of those memory cell blocks, I agree it's quite big. But if it's just one, it's actually quite small.

Agner's findings are for the Sandybridge implementation. He says Haswell and Skylake share the same limitations, but doesn't look like he has done much research into the later implementations.

The findings actually point to the uOP cache being much simpler in structure. The instruction cache has to support arbitrary instruction alignment and fetches that cross boundaries. The uOP cache has strict alignment requirements, it delivers one cache line per cycle and always delivers the entire line. If there aren't enough uops, then the rest of the cacheline is unused.

> Also visible from the die shots: decoding and branch prediction are far from free.

Yeah, it appears to be massive. And I get the impression that block is more branch prediction than decoding.

Nothing is free in CPU design, it's just a massive balancing act.

link

mbitsnbites 882 days ago

> According to the label, that block contains both the uop cache AND the microcode ROM

Yes, so it's hard to tell the exact size. We can only conclude that the uOP cache and the microcode ROM combined are about twice the size of the L1I cache (in terms of memory cells).

Another core die shot of the Zen 2 micro architecture is this (it appears to be correct as it is based on official AMD slides): https://forums.anandtech.com/proxy.php?image=https%3A%2F%2Fa...

Here uCode is in a separate area, and if we assume that the SRAM blocks in the area marked "Decode" represent the uOP cache, then we have:

* The uOP cache has the same physical size as the L1I cache

* uOP cache size = 4K uOPs

* L1I cache size = 32 KiB ~= 8K x86 instructions

If all this holds true (it's a big "if"), the number of uOP instructions that the uOP cache can hold is only half of the number of x86 instructions that the L1I cache can hold, and the size of uOP entries are in fact close to 32KiB / 4K uOPs = 64 bits each (given how similar the SRAM cells for the two caches are on the die shot I assume that they have the same density).

Furthermore I assume that one x86 instruction translates to more than one uOP instruction on average (e.g. instructions involving memory operands are cracked, and instructions with large immediates occupy more than one uOP slot - even the ARMv8 Vulcan microarchitecture sees a ~15% increase in instructions when cracking ARM instructions into uOPs: https://en.wikichip.org/wiki/cavium/microarchitectures/vulca... ), which would mean that the silicon area efficiency of the uOP cache compared to a regular L1I cache is even less than 50%.

Edit:

> Nothing is free in CPU design, it's just a massive balancing act.

Yup, and a large part of the x86 balancing act is to keep the x86 ISA alive and profit from the massive x86 ecosystem. Therefore Intel and AMD are prepared to sacrifice certain aspects, like power efficiency (and presumably performance too), and blow lots of energy on the x86 translation front end. That is a balancing act that designers of CPU:s with more modern ISA:s don't even have to consider.

link

phire 882 days ago

Yeah, that logic seems to all work out.

I found annotated die shots of Zen 3 and Zen 4 that pretty much confirm the op cache: https://locuza.substack.com/p/zen-evolution-a-small-overview

Pretty strong evidence that AMD are using a much simpler encoding scheme with roughly 64bits per uop. Also, That uop cache on Zen 4 is starting to look ridiculously large.

But that does give us a good idea how big the microcode rom is. If we go back to the previous intel die shot with its combined microcode rom + uop cache, it appears intel's uop cache is actually quite small thanks to their better encoding.

> Furthermore I assume that one x86 instruction translates to more than one uOP instruction on average (e.g. instructions involving memory operands are cracked

I suspect it's not massively higher one uop per instruction. Remember, the uop cache is in the fused-uop domain (so before memory cracking) and instruction fusion can actually squash some instructions pairs into a single uop.

The bigger hinderance will be any rules that prevent every uop slot from being filled. Intel appears to have many such rules (at least for Sandybridge/Haswell/Skylake)

> and blow lots of energy on the x86 translation front end

TBH, we have no idea how big the x86 tax is. We can't just assume the difference in performance per watt between the average x86 design and average high performance aarch64 design is entirely caused by the x86 tax.

Intel and AMD simply aren't incentivised to optimise their designs for low power consumption as their cores simply aren't used in mobile phones where ultra low power consumption is absolutely critical.

link

mbitsnbites 882 days ago

> I found annotated die shots of Zen 3 and Zen 4

Ooo thanks! Sure looks like strong evidence.

> TBH, we have no idea how big the x86 tax is.

No, and it gets even more uncertain when you consider different design targets. E.g. a 1000W Threadripper targets a completely different segment than a 10W ARM Cortex.Would an ARM chip designed to run at 1000W beat the Threadripper? Who knows?

> Intel and AMD simply aren't incentivised to optimise their designs for low power consumption as their cores simply aren't used in mobile phones where ultra low power consumption is absolutely critical.

They'll keep doing their thing until they can't compete. They lost mobile and embedded, and competitors are eating into laptops and servers where x86 continues to have a stronghold. But perf/watt matters in all segments these days, and binary compatibility is dropping in importance (e.g. compared to 20-40 years ago), much thanks to open source.

IMO the writing is on the wall, but it will take time (especially for the very slow server market).

link

phire 882 days ago

Yeah, I agree that the writing is on the wall for x86. As you said, power consumption does matter for laptops and server farms.

I'm a huge fan of aarch64, it's a very well designed ISA. I bought a Mac so I could get my hands of a high-preformance aarch64 core. I love the battery life and I'm not going back.

I only really defend x86 because nobody else does, and then people dog pile on it talking it down and misrepresenting what a modern x86 pipeline is.

Though I wouldn't write x86 off yet. I get the impression that Intel are planning to eventually abandon their P core arch (the one with direct lineage all the way back to the original Pentium Pro). They haven't being doing much innovation on it.

Intel's innovation is actually focused on their E core arch, which started as the Intel Atom and wasn't even out-of-order. It's slowly evolved over the years with a continued emphasis on low-power consumption until it's actually pretty completive with the P core arch.

If you compare Golden Cove and Gracemont, the frontend is radically different. Golden Cove has a stupid 6 wide decoder that can deliver 8 uops per cycle... though it's apparently sitting idle 80% of the time (power gated) thanks to the 4K uop cache.

Gracemont doesn't have a uop cache. Instead it uses the space for a much larger instruction cache and two instruction encoders running in parallel, each 3-wide. It's a much more efficient way to get 6-wide instruction decoding bandwidth, I assume they are tagging decode boundaries in the instruction cache.

Gracemont is also stupidly wide. Golden cove only has 12 execution unit ports, Gracemont has 17. It's a bit narrow in other places (only 5 uops per cycle between the front-end and backend) but give it a few more generations and a slight shift in focus and it could easily outperform the P core. Perhaps add a simple loop-stream buffer and upgrade to three or four of those 3-wide decoders running in parallel.

Such a design would have a significantly lower x86 tax. Low enough to save them in the laptop and server farm market? I have no idea. I'm just not writing them off.

link

mbitsnbites 882 days ago

BTW... Except for the indications from the die shots, one of the reasons that I don't think that uOPs can be as small as 32 bits is that studying fixed width ISAs and designing MRISC32 have made me appreciate the clever encoding tricks that go into fitting all instructions into 32 bits.

Many of the encoding tricks require compiler heuristics, and you don't want to do that in hardware. E.g. consider the AArch64 encoding of immediate values for bitwise operations.

Also, even if you manage to do efficient instruction encoding in hardware, you will probably end up in a situation where you need to add an advanced decoder after the uOP cache, which does not make much sense.

The main thing that x86 has going for it in this regard is that most instructions use destructive operands, which probably saves a bunch of bits in the uOP encoding space. But still, it would make much more sense to use more than 32 bits per uOP.

link

phire 882 days ago

> designing MRISC32 have made me appreciate the clever encoding tricks that go into fitting all instructions into 32 bits.

Keep in mind that the average RISC ISA uses 5 bit registers IDs and uses three-arg form for most instructions, that's 15 bits gone. While AMD64 uses 4 bit register IDs and uses two-arg form for most instructions, which is only 8 bits.

Also, the encoding scheme that Agner describes is not a fixed width encoding. It's variable width with 16bit, 32bit, 48bit and 64bit uops. There are also some (hopefully rare) uops which don't fit in the uop cache's encoding scheme (forcing a fallback to the instruction decoders). Those two relief valves allow such an encoding to avoid the needs for the complex encoding tricks of a proper fixed width encoding.

So I find the scheme to be plausible, though what you say about decoders after the uOP cache is true.

link