Hacker News new | ask | show | jobs
by uis 878 days ago
Read EDIT first.

But why not just do branch delay? Or some sort of prepared branches like on ELBRUS where you can write to predicate register many instrucions ago and take conditional branch at same cost of non-conditional or maybe even prefetch intstructions into second copy of front-end like E2K does. Or some sort of explicit branch slots.

Anyway, I registered here just to reply to you. I kinda want to talk to you more because you do what is interesting for me too.

EDIT: I see you already store "predicates" in registers.

2 comments

My understanding: Delay slots only make sense when you don't have a branch predictor.

I get the impression that many people massively underestimate how powerful branch prediction is. Even a simple 2 bit saturating counter will correctly predict about 90% of branches, and the accuracy only goes up with better designs.

So why optimise for the uncommon case of incorrectly predicted branches? In the worst case with static branch delay slots, it actually harms branch prediction, because instead of executing the correctly predicted instructions, it's executing the delay slot, which is often a nop because the compiler couldn't find something to put there.

With your other ideas (multiple decoders, explicit delay slots) it's just a question about if it's a good use of resources (design time, transistors, compiler support) to support this uncommon case, or if you might be better off optimising something else like the branch predictor so more code goes down the common path, or just improving the pipeline's throughput in general.

> Delay slots only make sense when you don't have a branch predictor.

Agreed.

I'll go even further: lots of other RISC ideas only really matter when you don't have a branch predictor (and your transistor and design budgets are tiny).

Single-length instructions, for example. Even x86 instructions aren't much of a problem when you can afford to throw more pipeline stages at the problem -- which you can with a branch predictor.

I don't think anybody in the mid-80's understood how good branch predictors can be. Most people probably still didn't really understand it in the early 90's.

> Single-length instructions, for example. Even x86 instructions aren't much of a problem when you can afford to throw more pipeline stages at the problem

I think that the problem is bigger than that. Sure, the branch predictor usually keeps even the x86 pipeline busy, but x86 variable length instruction encoding becomes a problem for at least three problems:

1. Decoding width - There is a practical limit to how many instructions you can decode in parallell. You can add pipeline steps, but at some point it becomes absurd.

2. You get a very big range of minimum-to-maximum number of bytes that you need to fetch to extract a fixed number of instructions each clock. E.g. an eight-wide front-end would have to fetch 8-120 bytes per clock cycle. And it gets worse since your next eight instructions may not start on a nice power-of-two boundary, so you have to fetch much more, e.g. 256 bytes per cycle, in order to cover the worst case scenario (compared 32 bytes per cycle for a fixed width 32-bit RISC encoding). And you may gate cache line / page crossings in your "bundle".

3. Since you need to do fairly heavy translation work into an internal RISC-like encoding (which, by the way can not be as compact as compiler-generated RISC instructions - you typically need 64 bits per internal instruction or similar), you need to cache your translations into a uOP cache (or L0 cache). This cache uses much more silicon per effective instruction than a regular L1I cache, so it can not hold as many instructions (and I'm pretty sure that most of the instructions in the L0 cache are stored in the L1 cache too - so not really extra cache memory). All this silicon could be used for a larger L1I cache, for instance (or a better branch predictor).

So, yes, branch prediction really helps, but it does not solve all problems.

> 1. Decoding width - There is a practical limit to how many instructions you can decode in parallell. You can add pipeline steps, but at some point it becomes absurd.

Branches. Branches also make really wide decodes useless. The cost/benefit is towards wider decoders for A64 than for AMD64. The average A64 instruction does slightly less work than the average AMD64 instruction so the net result is that it makes sense to have slightly wider decoders (in terms of "work") for A64 than for AMD64.

X86 CPUs don't quite use a "RISC-like encoding". The µops support RMW for memory, for example. The encoding is of course very much regularized, but I don't think the RISC people have a patent on that.

Translation to an internal format is common for high-performance RISC CPUs as well. The Power CPUs call it "cracking" when complicated instructions are split into simpler µops.

> The average A64 instruction does slightly less work than the average AMD64 instruction so the net result is that it makes sense to have slightly wider decoders

I'm not sure that the difference is that big. A64 actually has quite powerful instructions, and some of them do more work than similar x86 instructions (madd and ubfx come to mind). In my testing A64 code often has fewer instructions than x86: https://www.bitsnbites.eu/cisc-vs-risc-code-density/

> X86 CPUs don't quite use a "RISC-like encoding". The µops support RMW for memory, for example.

I would love to learn more about that. Do you have any references? I was under the impression that internal instructions followed the load/store principle since I assume that the internal pipeline is a load/store pipeline?

> The Power CPUs call it "cracking" when complicated instructions are split into simpler µops.

Yes, it's the IBM term AFAIK. They call it cracking in zArch too. I also suapect that at least some ARMv8/9 implementations do cracking too (many AArch64 instructions have multiple results, which might be better handled as multiple internal instructions - I think it's partly a code density thing).

> I was under the impression that internal instructions followed the load/store principle since I assume that the internal pipeline is a load/store pipeline?

Well... peterfirefly is making a very generalised statement that isn't really true.

As far as I'm aware, no out-of-order Intel processor can do a full read-modify-write in a single uOP. And if you go all the way back to the original P6 pipeline (Pentium Pro, Pentium II, Pentium III), it does appear to be a proper load-store arch. RMW instructions generate at least 4 uOPs

But the Pentium M and later can do a read + modify to register in a single fused uOP, and a RMW in just two fused uOPs. Fused uops kind of muddle the issue: they might issue to two or more execution units, but for the purposes of scheduling, they only take up a single slot.

So it's far from a proper load/store pipeline. And when you think about it, that makes sense, x86 isn't a load/store ISA so it would be wasteful to not have special accommodations for it.

-----

And then there is AMD. Zen and later are more or less identical to Intel's modern fused uOP scheme.

But their older cores had much more capable internal encoding which AMD called "macro-ops". And those macro-ops could do a full read-modify-write operation with a single op. Unlike Intel and the later Zen core, each integer execution unit needed to have both ALUs and AGUs, along with read/write ports to the data cache.

> I would love to learn more about that. Do you have any references?

Agner Fog is the best resource for this type of thing.

https://www.agner.org/optimize/

A combination of microarchitecture.pdf for details about the various pipelines and instruction_tables.pdf for what uops the various instructions breakdown into on the various pipelines.

I think it's a bit strong to say that branch prediction "solves" the problem of x86's complex variable length encoding. But it certainly goes a long way to mitigating the handicap, and allowing x86 uarches to be very competitive with more RISC designs.

I suspect that original handicap is a large part of the reason why x86 managed to become a dominant ISA, beating out all of its RISC derived competitors in the high-performance space. The requirement to continue supporting the legacy complex instruction encoding forced x86 uarch designers to go down the path of longer (but not too long) pipelines, powerful branch predictors and extremely large out-of-order windows.

It wasn't the obvious approach back in the 90s/2000s, high-performance RISC designs of that era tended to stick with in-order superscalar pipelines. And when they did explore out-of-order designs, they were much more restrained with way smaller reorder-buffers.

But in hindsight, it seems to have been the correct approach for high-performance micro arches. I can't help but notice that modern high-performance aarch64 cores from Apple and Arm have pipelines that look almost identical to the designs from AMD and Intel. Main difference is that they can get away with 8-wide instruction decoders instead of a uOP caches.

> which, by the way can not be as compact as compiler-generated RISC instructions - you typically need 64 bits per internal instruction or similar

Nah. According to Agner Fog's testing, Intel only allocates 32bits per uop.

Immediates/Addresses larger than signed 16bits are handled with various complex mechanisms. If a 32bit/64bit value is still inside the -2¹⁵ to +2¹⁵ range, it can be squashed down. Space can be borrowed from other uOPs in the same cacheline that don't need to store an immediate. Otherwise, the uOP takes up multiple slots in the uOP cache.

I suspect AMD also use a similar mechanism, because as you point out, caching un-encoded uOPs would be a huge waste of space. And there is no reason why you need to use the exact same encoding in both the pipeline and the uOP cache, it just needs to be significantly easier to decode than a full x86 instruction.

I'll have to read up on Agner's findings.

My assumptions are largely based on annotated die shots, like this one of Rocket Lake (IIRC): https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_pr...

If they are correct, the uOP cache consumes at least as much silicon as the L1I cache, while they generally can hold fewer instructions.

Some napkin math: x86 instructions are 4 bytes long on average, so a 32KiB L1I can hold 32/4=8K instructions, while the uOP cache can hold 4K uOP instructions (how many uOPs does an x86 instruction translate to on average?). That would indicate that uOP:s require twice the silicon area to store compared to "raw" x86 instructions - or that the uOP cache is more advanced/complex than the L1I cache (which may very well be the case).

Also visible from the die shots: decoding and branch prediction are far from free.

According to the label, that block contains both the uop cache AND the microcode ROM (which is actually at least partially RAM to allow for microcode updates). I guess it makes sense to group the two functions together, they are both alternative sources of uOPs that aren't from the instruction decoder.

So really depends on what the balance is. If it was two or three of those memory cell blocks, I agree it's quite big. But if it's just one, it's actually quite small.

Agner's findings are for the Sandybridge implementation. He says Haswell and Skylake share the same limitations, but doesn't look like he has done much research into the later implementations.

The findings actually point to the uOP cache being much simpler in structure. The instruction cache has to support arbitrary instruction alignment and fetches that cross boundaries. The uOP cache has strict alignment requirements, it delivers one cache line per cycle and always delivers the entire line. If there aren't enough uops, then the rest of the cacheline is unused.

> Also visible from the die shots: decoding and branch prediction are far from free.

Yeah, it appears to be massive. And I get the impression that block is more branch prediction than decoding.

Nothing is free in CPU design, it's just a massive balancing act.

He hangs out in the comp.arch newsgroup.