| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by phire 874 days ago

> I was under the impression that internal instructions followed the load/store principle since I assume that the internal pipeline is a load/store pipeline?

Well... peterfirefly is making a very generalised statement that isn't really true.

As far as I'm aware, no out-of-order Intel processor can do a full read-modify-write in a single uOP. And if you go all the way back to the original P6 pipeline (Pentium Pro, Pentium II, Pentium III), it does appear to be a proper load-store arch. RMW instructions generate at least 4 uOPs

But the Pentium M and later can do a read + modify to register in a single fused uOP, and a RMW in just two fused uOPs. Fused uops kind of muddle the issue: they might issue to two or more execution units, but for the purposes of scheduling, they only take up a single slot.

So it's far from a proper load/store pipeline. And when you think about it, that makes sense, x86 isn't a load/store ISA so it would be wasteful to not have special accommodations for it.

-----

And then there is AMD. Zen and later are more or less identical to Intel's modern fused uOP scheme.

But their older cores had much more capable internal encoding which AMD called "macro-ops". And those macro-ops could do a full read-modify-write operation with a single op. Unlike Intel and the later Zen core, each integer execution unit needed to have both ALUs and AGUs, along with read/write ports to the data cache.

> I would love to learn more about that. Do you have any references?

Agner Fog is the best resource for this type of thing.

https://www.agner.org/optimize/

A combination of microarchitecture.pdf for details about the various pipelines and instruction_tables.pdf for what uops the various instructions breakdown into on the various pipelines.

1 comments

mbitsnbites 874 days ago

Thanks, I have read the Agner documents before. I will dig around some more and get updated.

Anyway, I found this, regarding RMW (for Ice/Tiger Lake):

> Most instructions with a memory operand are split into multiple μops at the allocation stage. Read-modify instructions, such as add eax,[rbx], are split into two μops, one for address calculation and memory read, and one for the addition. Read-modify-write instructions, such as add [rbx],eax, are split into four μops.

I read it as a instructions that use memory operands (other than simple mov instructions) are usually split into at least two uOPs, which makes perfect sense for a load/store pipeline.

> So it's far from a proper load/store pipeline. And when you think about it, that makes sense, x86 isn't a load/store ISA so it would be wasteful to not have special accommodations for it.

The way I see it, modern x86 microarchitectures are load/store. My definition of load/store is that all instructions/operations that flow through the execution part of the pipeline can either load/store data OR perform operations on registers, not both (except possibly edge cases like calculating an address or writing back an updated address to a register).

That is by far the most efficient way to implement a CPU pipeline: You don't want to read data in one pipeline stage, use the data in an ALU in a later stage, and possibly write data in an even later stage. That would drastically increase instruction latency and/or require duplication of resources.

This is, AFAIK, one of the main advantages and probably the raison d'être for uOPs is the first place: translate x86 instructions into uOPs (multiple ones for instructions that access memory) so that the pipeline can be implemented as a traditional load/store pipeline.

In a way the x86 front end is similar to software binary translation (a'la Transmeta, NVIDIA Denver or Apple Rosetta 2). It's fairly complex, and the prime objective is to take code for a legacy ISA and transform it into something that can run in a pipeline that the ISA was originally not intended to run in. By doing the translation in hardware you avoid the latencies inherent to software translation (JIT or AOT), but the costs are unavoidable (particularly silicon area and power consumption).

link

phire 874 days ago

It's only a load/store architecture if you consider the "unfused-uop" to be the native internal ISA of the underlying pipeline.

But that seems to be an incorrect perspective. The pipeline's native internal ISA appears to very much be the "fused-uop". It's the only metric which matters for most of the pipeline, decode limits are specified in fused-uops, the uop cache stores fused uops, the ROB only uses a single entry for fused-uop. The only part of the pipeline were that deals with unfused-uops is the scheduler and the execution units themselves. Even the retire stage works on fused-uops.

It's probably better to think of the pipeline's native ISA as an instruction that can sometimes be scheduled to two execution units. It's almost a very basic VLIW arch, if you ignore the dynamic scheduling.

Sure, the execution units are load/store. And the scheduling is load/store. But I don't think that's enough to label the entire pipeline as load/store since absolutely every other part of the pipeline uses fused-uops and is therefore not load/store.

> This is, AFAIK, one of the main advantages and probably the raison d'être for uOPs is the first place: translate x86 instructions into uOPs so that the pipeline can be implemented as a traditional load/store pipeline.

I'm really not a fan of the "translate" terminology being used to describe modern x86 pipelines. It's not quite wrong, but it does seem to mislead people (especially RISC fans) into overstating the nature of the transformation.

It's nothing like software binary translation (especially something like Rosetta 2), the transforms are far simpler. It's not like Intel took an off-the-shelf RISC architecture and limited their efforts to just designing a translation frontend for it (the few examples of direct hardware translation, like NVIDIA Denver and Itanium have pretty horrible performance in that mode).

No, they designed and evolved the pipeline and its internal ISA to directly match the x86 ISA they needed to run.

All the front end is really doing is regularising the encoding to something sane and splitting up some of the more complex legacy instructions. Instructions with memory operands are converted to a single fused-uop. The front-end only splits the read-modify-write instructions into two fused-uops. The transform into proper load/store form doesn't happen until much further down the pipeline as the fused-uop gets inserted into the scheduler.

I have quite a bit of experience writing software binary translation software, and I ensure you such translations are significantly more complex than the transforms you find inside an x86 pipeline.

> Thanks, I have read the Agner documents before. I will dig around some more and get updated.

I swear every single time I read them (or usually just parts of them) I learn more about x86 microarches (and CPU design in general). It's not something that can be absorbed in a single pass.

link

mbitsnbites 874 days ago

Good points. I guess it's a debate of nomenclature, which lacks real importance (although it helps to reduce confusion).

My point of view is mostly that, no, the x86 architecture certainly is not load-store, but internally modern x86 machines have execution pipelines that are built like regular load-store pipelines (i.e. the topology and flow is mostly the same as a RISC-style load-store pipeline).

Or to put it another way, x86 execution pipelines are much closer to being register-register than being register-memory.

> No, they designed and evolved the pipeline and its internal ISA to directly match the x86 ISA they needed to run.

Yes. That is very true. Although the front end is the part of the pipeline that is most x86-specific, there are many parts of the rest of the pipeline that is tailored to be optimal for x86 code. It was obviously not designed in a vacuum.

An interesting observation is that even other ISA:s and microarchitectures have been influenced by x86 (e.g. by including similar flags registers in the architectural state), in order to not suck at emulation of x86 code.

link

phire 873 days ago

> I guess it's a debate of nomenclature, which lacks real importance (although it helps to reduce confusion).

Agreed. I wouldn't be pedantic about this for a higher level discussion, but when you start to get into the weeds of what a modern x86 pipeline is, it's starts to matter.

> (i.e. the topology and flow is mostly the same as a RISC-style load-store pipeline).

So that brings up an interesting question. What is a RISC-style pipeline?

Everyone mostly agrees what a RISC-style ISA is, usually focusing on the load/store arch and fixed length instructions (and exposed pipeline implementation details was another common attribute of early RISC ISAs, but everyone now agrees that was a bad idea).

But nobody ever seems to define "RISC-style pipeline", or they allude to overly broad definitions like "Everything that's not CISC" or "any pipeline for a RISC-style ISA". But you might have noticed that I like more precision in my nomenclature. And modern x86 CPUs prove the point that just because something is executing a CISC like ISA doesn't mean the pipeline is also CISC. There is nothing stopping someone implementing a CISC style uarch for a RISC style ISA (other than the fact it's a stupid idea).

I like to make a the bold argument the definition of "RISC-style pipeline" should be limited to simple in-order pipelines that are optimised for high-throughput that approaches one instruction per cycle (or group of instructions for super-scalar designs). The 5-stage MIPS pipeline is probably the most classic example of a RISC pipeline and the one usually taught in cpu architecture courses. But it ranges from 3 stage pipeline of Berkley RISC and early ARM chips to... well I think we have to include the PowerPC core from the PowerPC and Xbox 360 in this definition, and that has something like 50 stages.

(BTW, I also like to exclude VLIW arches from this definition of a RISC-style pipeline, sorry to everyone like Itanium and Transmeta)

Your 8 stage MRISC32 pipeline is a great sample of a modern RISC-style pipeline and along with all the in-order cores from ARM that are often around the same length.

But this does mean I'm arguing that anything with out-of-order execution is not RISC. Maybe you could argue that some simpler out-of-order schemes (like the early powerpc cores) aren't that far from RISC pipelines because they only have very small OoO windows. But by the time you get to the modern high-preformance paradigm of many instruction decoders (or uop caches) and absolutely massive re-order buffers, unified physical register file and many specialised execution units.

It's very much a different architecture to the original RISC-style pipelines, even if they still implement a RISC style ISA. It's an architecture that we don't have a name for, but I think we should. Intel pioneered this architecture starting with the P6 pipeline and perfecting it around the time of Sandybridge, and then everyone else seems to copying it for their high-performance cores.

I'm going to call this arch the "massively out-of-order" for now.

------------

Anyway, I wanted to switch to this way more precise and pedantic definition of "RISC-style pipeline" so that I could point out a key difference in the motivation in why Intel adopted this load-store aspect to their pipeline compared to RISC-style pipeline.

RISC-style pipelines are load-store partly because it results in a much more compact instruction encoding and allows fixed-length instruction, but mostly because it allows for a much simpler and more compact pipeline. The memory stage of the pipeline is typically after the execute stage. If a RISC-style pipeline needed to implement register-memory operations, they would need to move the execute stage after the memory stage completes, resulting in a significantly longer pipeline. And very problematic for the most pure RISC-style pipelines that don't have branch predictors and are relying on short their pipelines and branch-delay slots for branch performance.

A massively out-of-order x86 pipeline gets neither advantage. The instruction encoding is still register-memory (plus those insane read-modify-write instructions), and the cracking the instructions into multiple uops causes extra pipeline complexity. Also, they have good branch predictors.

The primary motivation for cracking those instructions is actually memory latency. Those pipelines want to be able to execute the memory uop as soon as possible so that if they miss in the L1 cache, the latency can be hidden.

This memory latency hiding advantage is also one of the major reasons why modern high-performance RISC cores moved away from the RISC-style pipeline to adopt the same massively out-of-order pipelines as x86. They just have a large advantage in the fact that their ISA is already load/store.

link

mbitsnbites 873 days ago

I see what your pointing at. I don't think that we'll fully agree on the nomenclature, but this kind of feels like the RISC vs CISC debate all over again. The reality is that the waters are muddied from the 1990's and onward.

I know that I have a tendency to over-use the word "RISC" (basically for anything that is not a 70's CISC state-machine).

> or they allude to overly broad definitions like "Everything that's not CISC"

Yup, that's me ;-)

BTW, w.r.t. nomenclature, I make a clear distinction between "architecture" and "microarchitecture" (even if I mix up contexts at times).

> But this does mean I'm arguing that anything with out-of-order execution is not RISC.

I think that this is where we disagree. E.g. the POWER1 (1990) was the first (non-mainframe) out-of-order CPU with register renaming, and it was a RISC. The Alpha 21264 (1998) was definitely both RISC and OoO. One of the first x86 implementations with uOPs translation, the NexGen Nx686 (1995, later AMD K6), was also out-of-order, and was said to have a RISC microarchitecture (based on RISC86). Not as complex as modern cores, but drawing the line at OoO does not work for me.

Historically RISC meant many things, and as you said the early RISC machines had many design quirks that did not stand the test of time (in particular exposing too many microarchitectural details in the ISA - something that proved useful in low-power VLIW DSP:s, though).

However, the key takeaway that has stood the test of time is an instruction set that enables fully pipelined execution. In the 70's and 80's, using a load-store (register-register) ISA was the only viable way out of multi-cycle instructions. To me, the principle to design instructions for pipelined execution is the main point of RISC, and the key point where it differs from CISC ISA:s, which were specifically designed for state-machine style microarchitectures (I don't have a better term).

In the 90's the same principles were implemented at the microarchitecture level (Nx586, K6, P6), without changing the architecture (i.e. the x86 ISA was kept on the surface).

Out-of-order happened to arrive to microprocessors at around the same time, so it was an obvious natural development for all high-performance CPU:s, regardless of "RISC" branding or not. It was the way forward to increase ILP (yes, you can do superscalar in-order too, but there's a rather annoying limit to parallelism there). It just so happened that cracking x86 instructions into multiple uOPs was a good way to make better use of OoO as well (in fact, that kind of cracking was exactly what was proposed by the RISC crowd in the 70's and 80's, but at the ISA level rather than at the microarchitecture level).

> I'm going to call this arch the "massively out-of-order" for now.

Mitch Alsup calls it GBOoO (Great Big Out-of-Order). There are some merits to making that distinction - but like you I would like to see a widely adopted definition.

Yet, I will continue to use terms like "RISC-like pipeline". I guess one of my main motivations is to make people understand that x86 is no longer CISC "under the hood". Especially with RISC-V being popularized, a new generation comes asking questions about RISC vs CISC (as in RISC-V vs x86), without understanding the difference between an architecture and a microarchitecture.

For most intents and purposes most GBOoO microarchitectures are comparable when it comes to the execution pipeline, regardless of which ISA they are using. The main differences are in the front end - but even there many of the principles are the same (it's mostly a question of how much effort needs to be spent on different parts - like decoding, prediction, caching).

link

mbitsnbites 873 days ago

Correction: Ok, the IBM z/Architecture line of CPU:s are clearly a different breed. In later generations they do use instruction cracking (i think that they were inspired by the x86 success), and insane pipelines:

https://www.semanticscholar.org/paper/History-of-IBM-Z-Mainf...

I struggle to find a categorization for the z15 - other than "massive".

link

phire 872 days ago

> Mitch Alsup calls it GBOoO (Great Big Out-of-Order).

I like that term. Do you have any suggested reading material from Alsup?

------------

> I see what you're pointing at. I don't think that we'll fully agree on the nomenclature,

Ok, I admit I might be going a little far by trying to redefine anything that isn't a classic in-order RISC pipeline as "not RISC" (even when they have a RISC style ISA). And as an amateur CPU architecture historian, I'm massively underqualified to be trying to redefine things.

I'm also not a fan of the fact that my argument defines any pipeline with any amount of OoO as "not RISC". Because I do know the early PowerPC pipeline quite well (especially the 750 pipeline), and the amount of out-of-order is very limited.

There is no reorder buffer. There is no schedule, instead dispatch only considers the next two instructions in the instruction queue, and there is only one reservation station per execution pipeline. For the 601, there are only three pipelines (Integer, FPU and Special) and branches are handled before dispatch. So while a branch or FPU instruction might be executed before an Integer instruction, you can't have two instructions for the same pipeline execute out of order.

I don't think the 601 even has renaming registers, there is no need as Integer instructions, Floating instructions AND branch instructions all operate on different register sets (and I'm just realising exactly why PowerPC has those seperate condition registers)

Now that I think about it, the 601 pipeline might be described as a superscalar in-order RISC pipeline that simply relaxes the restriction on the different execution pipes starting out of program order.

Maybe I should be altering my argument so to allow simpler out-of-order schemes to still be considered RISC. The 601 is certainly not something us people from the future would recognise as OoO except by the strictest definition of somethings instructions execute out-of-order.

The later PowerPC designed do muddy the water; The 604 (1996) introduces the concept of multiple integer pipelines that can execute the same instructions. They only have one reservation station each, but this will allow instructions of the same type to be executed out of order via different pipeline. The load/store instructions were moved to their own pipeline, in the later 750 design (aka the G3, 1997), the load store pipeline gained two reservation stations, allowing memory instructions to be executed out of order down the same pipeline.

It's not until the PowerPC 7450 (aka the G5, 2001) that the PowerPC finally gained something approaching a proper scheduler, removing the one reservation station per pipeline bottleneck.

> E.g. the POWER1 (1990) was the first (non-mainframe) out-of-order CPU with register renaming, and it was a RISC.

As I understand, the POWER1 is about the same as the PowerPC 601. There is no register renaming, the only out-of-order execution is the fact that branch instructions execute early, and floating point instructions can execute out of order with respect to integer instructions.

I don't think there is a RISC cpu with register renaming until the PowerPC 604 in 1996 or maybe PowerPC 750 in 1997, and that was very limited, only a few renaming registers.

---------------

> but this kind of feels like the RISC vs CISC debate all over again

Yes. And my viewpoint originates from my preferred answer to the RISC vs CISC debate. That they are outdated terms that belong to the high-performance designs of 80s and early 90s, and don't have any relevance to modern GBOoO designs (though RISC does continue to be relevant for lower-power and low area designs)

> I guess one of my main motivations is to make people understand that x86 is no longer CISC "under the hood"

We both agree that GBOoO designs aren't CISC. I'm just taking it a step further in saying they aren't RISC either.

But my preferred stance leads to so many questions. If such designs aren't RISC then what are they? Where should the line between RISC and not-RISC be drawn? If we are allowing more than just two categories, then how many more do we need?

It's certainly tempting to adopt your "everything is either CISC or RISC" stance just to avoid those complicated questions, but instead I try to describe lines.

And I think you agree with me that having accepted definitions for groupings of related microarchitectures would be useful, even if you want them to be sub-categories under RISC.

> BTW, w.r.t. nomenclature, I make a clear distinction between "architecture" and "microarchitecture" (even if I mix up contexts at times).

Yeah, I try to avoid "architecture" all together, though I often slip up. I use ISA for the instruction set and microarchitecture or uarch for the hardware implementation.

----

> However, the key takeaway that has stood the test of time is an instruction set that enables fully pipelined execution...

So I agree with all this. I think what I'm trying to do (this conversation is very helpful for thinking though things) is add an additional restriction that RISC is also about trying to optimise that pipeline to be as short as possible.

Pipeline length is very much the enemy for in-order pipelines. The longer the back-end have, the more likely you are to have data hazards. And a data hazard is really just a multi-cycle instruction in disguise. This is a major part of the reason why RISC always pairs with load/store. Also the more stages you have in the front-end, the larger your branch misprediction delay (and in-order pipelines are often paired with weak branch predictors, if they have one at all).

But the switch to the GBOoO style architecture has a massive impact on this paradigm. Suddenly, pipeline length stops being so critical. You still don't want to go crazy, but now your scheduler finds different instructions to re-order into the gaps that would have been data hazard stalls in an in-order design. And part of the price you pay for GBOoO is a more complex frontend (even a RISC ISA requires extra complexity for OoO over In-order), but you are happy to pay that cost because of the benefits, and the complex branch predictors help mitigate the downsides.

(I don't know where Alsup wants to draw the line for GBOoO, but I'm taking OoO designs with proper schedulers and ROBs with dozens of entires. Designs like the early PowerPC chips with their limited OoO don't count, they were still very much optimising for short pipeline lengths)

I'm arguing that this large paradigm shift in design is enough justification to draw a line and limit RISC to just the classic RISC style pipelines.

> the NexGen Nx686 (1995, later AMD K6), was also out-of-order, and was said to have a RISC microarchitecture (based on RISC86).

I don't like relying on how engineers described or how the marketing team branded their CPU design for deciding if a given microarchitecture is RISC or not. RISC was more of a buzzword than anything else, and the definition was muddy.

A major point against the NexGen line being RISC (including all AMD designs from the K6 to Bulldozer etc) is that they don't crack register-memory instructions into independent uops. I pointed this previously, but their integer pipelines can do a full read/modify/write operation with a single uop. I don't know about you, but I'm pretty attached to the idea that RISC must be load/store.

This is also part of the reason why I want more terms than just RISC and CISC. Because the NexGen line is clearly not CISC either.

And we also have to consider the other x86 designs from the 486 and 586 era. They are fully pipelined and even superscalar, but they don't crack up register-memory ops, and their pipelines haven't been optimised for length, so it would be wrong to label them as RISC or RISC-like.

But they are so far from the "state-machine style microarchitectures" (and I think that's a perfectly fine term) that CISC originated from that I think it's very disingenuous to label them as CISC or CISC-like either.

> For most intents and purposes most GBOoO microarchitectures are comparable when it comes to the execution pipeline, regardless of which ISA they are using. The main differences are in the front end - but even there many of the principles are the same

The execution pipelines themselves might be very comparable, but you are forgetting the scheduling, which adds massively to backend complexity, and makes a major impact to the overall microarchitecture and the design paradigms.

link