| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by peterfirefly 881 days ago

> 1. Decoding width - There is a practical limit to how many instructions you can decode in parallell. You can add pipeline steps, but at some point it becomes absurd.

Branches. Branches also make really wide decodes useless. The cost/benefit is towards wider decoders for A64 than for AMD64. The average A64 instruction does slightly less work than the average AMD64 instruction so the net result is that it makes sense to have slightly wider decoders (in terms of "work") for A64 than for AMD64.

X86 CPUs don't quite use a "RISC-like encoding". The µops support RMW for memory, for example. The encoding is of course very much regularized, but I don't think the RISC people have a patent on that.

Translation to an internal format is common for high-performance RISC CPUs as well. The Power CPUs call it "cracking" when complicated instructions are split into simpler µops.

1 comments

mbitsnbites 881 days ago

> The average A64 instruction does slightly less work than the average AMD64 instruction so the net result is that it makes sense to have slightly wider decoders

I'm not sure that the difference is that big. A64 actually has quite powerful instructions, and some of them do more work than similar x86 instructions (madd and ubfx come to mind). In my testing A64 code often has fewer instructions than x86: https://www.bitsnbites.eu/cisc-vs-risc-code-density/

> X86 CPUs don't quite use a "RISC-like encoding". The µops support RMW for memory, for example.

I would love to learn more about that. Do you have any references? I was under the impression that internal instructions followed the load/store principle since I assume that the internal pipeline is a load/store pipeline?

> The Power CPUs call it "cracking" when complicated instructions are split into simpler µops.

Yes, it's the IBM term AFAIK. They call it cracking in zArch too. I also suapect that at least some ARMv8/9 implementations do cracking too (many AArch64 instructions have multiple results, which might be better handled as multiple internal instructions - I think it's partly a code density thing).

link

phire 881 days ago

> I was under the impression that internal instructions followed the load/store principle since I assume that the internal pipeline is a load/store pipeline?

Well... peterfirefly is making a very generalised statement that isn't really true.

As far as I'm aware, no out-of-order Intel processor can do a full read-modify-write in a single uOP. And if you go all the way back to the original P6 pipeline (Pentium Pro, Pentium II, Pentium III), it does appear to be a proper load-store arch. RMW instructions generate at least 4 uOPs

But the Pentium M and later can do a read + modify to register in a single fused uOP, and a RMW in just two fused uOPs. Fused uops kind of muddle the issue: they might issue to two or more execution units, but for the purposes of scheduling, they only take up a single slot.

So it's far from a proper load/store pipeline. And when you think about it, that makes sense, x86 isn't a load/store ISA so it would be wasteful to not have special accommodations for it.

-----

And then there is AMD. Zen and later are more or less identical to Intel's modern fused uOP scheme.

But their older cores had much more capable internal encoding which AMD called "macro-ops". And those macro-ops could do a full read-modify-write operation with a single op. Unlike Intel and the later Zen core, each integer execution unit needed to have both ALUs and AGUs, along with read/write ports to the data cache.

> I would love to learn more about that. Do you have any references?

Agner Fog is the best resource for this type of thing.

https://www.agner.org/optimize/

A combination of microarchitecture.pdf for details about the various pipelines and instruction_tables.pdf for what uops the various instructions breakdown into on the various pipelines.

link

mbitsnbites 881 days ago

Thanks, I have read the Agner documents before. I will dig around some more and get updated.

Anyway, I found this, regarding RMW (for Ice/Tiger Lake):

> Most instructions with a memory operand are split into multiple μops at the allocation stage. Read-modify instructions, such as add eax,[rbx], are split into two μops, one for address calculation and memory read, and one for the addition. Read-modify-write instructions, such as add [rbx],eax, are split into four μops.

I read it as a instructions that use memory operands (other than simple mov instructions) are usually split into at least two uOPs, which makes perfect sense for a load/store pipeline.

> So it's far from a proper load/store pipeline. And when you think about it, that makes sense, x86 isn't a load/store ISA so it would be wasteful to not have special accommodations for it.

The way I see it, modern x86 microarchitectures are load/store. My definition of load/store is that all instructions/operations that flow through the execution part of the pipeline can either load/store data OR perform operations on registers, not both (except possibly edge cases like calculating an address or writing back an updated address to a register).

That is by far the most efficient way to implement a CPU pipeline: You don't want to read data in one pipeline stage, use the data in an ALU in a later stage, and possibly write data in an even later stage. That would drastically increase instruction latency and/or require duplication of resources.

This is, AFAIK, one of the main advantages and probably the raison d'être for uOPs is the first place: translate x86 instructions into uOPs (multiple ones for instructions that access memory) so that the pipeline can be implemented as a traditional load/store pipeline.

In a way the x86 front end is similar to software binary translation (a'la Transmeta, NVIDIA Denver or Apple Rosetta 2). It's fairly complex, and the prime objective is to take code for a legacy ISA and transform it into something that can run in a pipeline that the ISA was originally not intended to run in. By doing the translation in hardware you avoid the latencies inherent to software translation (JIT or AOT), but the costs are unavoidable (particularly silicon area and power consumption).

link

phire 880 days ago

It's only a load/store architecture if you consider the "unfused-uop" to be the native internal ISA of the underlying pipeline.

But that seems to be an incorrect perspective. The pipeline's native internal ISA appears to very much be the "fused-uop". It's the only metric which matters for most of the pipeline, decode limits are specified in fused-uops, the uop cache stores fused uops, the ROB only uses a single entry for fused-uop. The only part of the pipeline were that deals with unfused-uops is the scheduler and the execution units themselves. Even the retire stage works on fused-uops.

It's probably better to think of the pipeline's native ISA as an instruction that can sometimes be scheduled to two execution units. It's almost a very basic VLIW arch, if you ignore the dynamic scheduling.

Sure, the execution units are load/store. And the scheduling is load/store. But I don't think that's enough to label the entire pipeline as load/store since absolutely every other part of the pipeline uses fused-uops and is therefore not load/store.

> This is, AFAIK, one of the main advantages and probably the raison d'être for uOPs is the first place: translate x86 instructions into uOPs so that the pipeline can be implemented as a traditional load/store pipeline.

I'm really not a fan of the "translate" terminology being used to describe modern x86 pipelines. It's not quite wrong, but it does seem to mislead people (especially RISC fans) into overstating the nature of the transformation.

It's nothing like software binary translation (especially something like Rosetta 2), the transforms are far simpler. It's not like Intel took an off-the-shelf RISC architecture and limited their efforts to just designing a translation frontend for it (the few examples of direct hardware translation, like NVIDIA Denver and Itanium have pretty horrible performance in that mode).

No, they designed and evolved the pipeline and its internal ISA to directly match the x86 ISA they needed to run.

All the front end is really doing is regularising the encoding to something sane and splitting up some of the more complex legacy instructions. Instructions with memory operands are converted to a single fused-uop. The front-end only splits the read-modify-write instructions into two fused-uops. The transform into proper load/store form doesn't happen until much further down the pipeline as the fused-uop gets inserted into the scheduler.

I have quite a bit of experience writing software binary translation software, and I ensure you such translations are significantly more complex than the transforms you find inside an x86 pipeline.

> Thanks, I have read the Agner documents before. I will dig around some more and get updated.

I swear every single time I read them (or usually just parts of them) I learn more about x86 microarches (and CPU design in general). It's not something that can be absorbed in a single pass.

link

mbitsnbites 880 days ago

Good points. I guess it's a debate of nomenclature, which lacks real importance (although it helps to reduce confusion).

My point of view is mostly that, no, the x86 architecture certainly is not load-store, but internally modern x86 machines have execution pipelines that are built like regular load-store pipelines (i.e. the topology and flow is mostly the same as a RISC-style load-store pipeline).

Or to put it another way, x86 execution pipelines are much closer to being register-register than being register-memory.

> No, they designed and evolved the pipeline and its internal ISA to directly match the x86 ISA they needed to run.

Yes. That is very true. Although the front end is the part of the pipeline that is most x86-specific, there are many parts of the rest of the pipeline that is tailored to be optimal for x86 code. It was obviously not designed in a vacuum.

An interesting observation is that even other ISA:s and microarchitectures have been influenced by x86 (e.g. by including similar flags registers in the architectural state), in order to not suck at emulation of x86 code.

link

phire 880 days ago

> I guess it's a debate of nomenclature, which lacks real importance (although it helps to reduce confusion).

Agreed. I wouldn't be pedantic about this for a higher level discussion, but when you start to get into the weeds of what a modern x86 pipeline is, it's starts to matter.

> (i.e. the topology and flow is mostly the same as a RISC-style load-store pipeline).

So that brings up an interesting question. What is a RISC-style pipeline?

Everyone mostly agrees what a RISC-style ISA is, usually focusing on the load/store arch and fixed length instructions (and exposed pipeline implementation details was another common attribute of early RISC ISAs, but everyone now agrees that was a bad idea).

But nobody ever seems to define "RISC-style pipeline", or they allude to overly broad definitions like "Everything that's not CISC" or "any pipeline for a RISC-style ISA". But you might have noticed that I like more precision in my nomenclature. And modern x86 CPUs prove the point that just because something is executing a CISC like ISA doesn't mean the pipeline is also CISC. There is nothing stopping someone implementing a CISC style uarch for a RISC style ISA (other than the fact it's a stupid idea).

I like to make a the bold argument the definition of "RISC-style pipeline" should be limited to simple in-order pipelines that are optimised for high-throughput that approaches one instruction per cycle (or group of instructions for super-scalar designs). The 5-stage MIPS pipeline is probably the most classic example of a RISC pipeline and the one usually taught in cpu architecture courses. But it ranges from 3 stage pipeline of Berkley RISC and early ARM chips to... well I think we have to include the PowerPC core from the PowerPC and Xbox 360 in this definition, and that has something like 50 stages.

(BTW, I also like to exclude VLIW arches from this definition of a RISC-style pipeline, sorry to everyone like Itanium and Transmeta)

Your 8 stage MRISC32 pipeline is a great sample of a modern RISC-style pipeline and along with all the in-order cores from ARM that are often around the same length.

But this does mean I'm arguing that anything with out-of-order execution is not RISC. Maybe you could argue that some simpler out-of-order schemes (like the early powerpc cores) aren't that far from RISC pipelines because they only have very small OoO windows. But by the time you get to the modern high-preformance paradigm of many instruction decoders (or uop caches) and absolutely massive re-order buffers, unified physical register file and many specialised execution units.

It's very much a different architecture to the original RISC-style pipelines, even if they still implement a RISC style ISA. It's an architecture that we don't have a name for, but I think we should. Intel pioneered this architecture starting with the P6 pipeline and perfecting it around the time of Sandybridge, and then everyone else seems to copying it for their high-performance cores.

I'm going to call this arch the "massively out-of-order" for now.

------------

Anyway, I wanted to switch to this way more precise and pedantic definition of "RISC-style pipeline" so that I could point out a key difference in the motivation in why Intel adopted this load-store aspect to their pipeline compared to RISC-style pipeline.

RISC-style pipelines are load-store partly because it results in a much more compact instruction encoding and allows fixed-length instruction, but mostly because it allows for a much simpler and more compact pipeline. The memory stage of the pipeline is typically after the execute stage. If a RISC-style pipeline needed to implement register-memory operations, they would need to move the execute stage after the memory stage completes, resulting in a significantly longer pipeline. And very problematic for the most pure RISC-style pipelines that don't have branch predictors and are relying on short their pipelines and branch-delay slots for branch performance.

A massively out-of-order x86 pipeline gets neither advantage. The instruction encoding is still register-memory (plus those insane read-modify-write instructions), and the cracking the instructions into multiple uops causes extra pipeline complexity. Also, they have good branch predictors.

The primary motivation for cracking those instructions is actually memory latency. Those pipelines want to be able to execute the memory uop as soon as possible so that if they miss in the L1 cache, the latency can be hidden.

This memory latency hiding advantage is also one of the major reasons why modern high-performance RISC cores moved away from the RISC-style pipeline to adopt the same massively out-of-order pipelines as x86. They just have a large advantage in the fact that their ISA is already load/store.

link