| > I guess it's a debate of nomenclature, which lacks real importance (although it helps to reduce confusion). Agreed. I wouldn't be pedantic about this for a higher level discussion, but when you start to get into the weeds of what a modern x86 pipeline is, it's starts to matter. > (i.e. the topology and flow is mostly the same as a RISC-style load-store pipeline). So that brings up an interesting question. What is a RISC-style pipeline? Everyone mostly agrees what a RISC-style ISA is, usually focusing on the load/store arch and fixed length instructions (and exposed pipeline implementation details was another common attribute of early RISC ISAs, but everyone now agrees that was a bad idea). But nobody ever seems to define "RISC-style pipeline", or they allude to overly broad definitions like "Everything that's not CISC" or "any pipeline for a RISC-style ISA". But you might have noticed that I like more precision in my nomenclature. And modern x86 CPUs prove the point that just because something is executing a CISC like ISA doesn't mean the pipeline is also CISC. There is nothing stopping someone implementing a CISC style uarch for a RISC style ISA (other than the fact it's a stupid idea). I like to make a the bold argument the definition of "RISC-style pipeline" should be limited to simple in-order pipelines that are optimised for high-throughput that approaches one instruction per cycle (or group of instructions for super-scalar designs). The 5-stage MIPS pipeline is probably the most classic example of a RISC pipeline and the one usually taught in cpu architecture courses. But it ranges from 3 stage pipeline of Berkley RISC and early ARM chips to... well I think we have to include the PowerPC core from the PowerPC and Xbox 360 in this definition, and that has something like 50 stages. (BTW, I also like to exclude VLIW arches from this definition of a RISC-style pipeline, sorry to everyone like Itanium and Transmeta) Your 8 stage MRISC32 pipeline is a great sample of a modern RISC-style pipeline and along with all the in-order cores from ARM that are often around the same length. But this does mean I'm arguing that anything with out-of-order execution is not RISC. Maybe you could argue that some simpler out-of-order schemes (like the early powerpc cores) aren't that far from RISC pipelines because they only have very small OoO windows. But by the time you get to the modern high-preformance paradigm of many instruction decoders (or uop caches) and absolutely massive re-order buffers, unified physical register file and many specialised execution units. It's very much a different architecture to the original RISC-style pipelines, even if they still implement a RISC style ISA. It's an architecture that we don't have a name for, but I think we should. Intel pioneered this architecture starting with the P6 pipeline and perfecting it around the time of Sandybridge, and then everyone else seems to copying it for their high-performance cores. I'm going to call this arch the "massively out-of-order" for now. ------------ Anyway, I wanted to switch to this way more precise and pedantic definition of "RISC-style pipeline" so that I could point out a key difference in the motivation in why Intel adopted this load-store aspect to their pipeline compared to RISC-style pipeline. RISC-style pipelines are load-store partly because it results in a much more compact instruction encoding and allows fixed-length instruction, but mostly because it allows for a much simpler and more compact pipeline. The memory stage of the pipeline is typically after the execute stage. If a RISC-style pipeline needed to implement register-memory operations, they would need to move the execute stage after the memory stage completes, resulting in a significantly longer pipeline. And very problematic for the most pure RISC-style pipelines that don't have branch predictors and are relying on short their pipelines and branch-delay slots for branch performance. A massively out-of-order x86 pipeline gets neither advantage. The instruction encoding is still register-memory (plus those insane read-modify-write instructions), and the cracking the instructions into multiple uops causes extra pipeline complexity. Also, they have good branch predictors. The primary motivation for cracking those instructions is actually memory latency. Those pipelines want to be able to execute the memory uop as soon as possible so that if they miss in the L1 cache, the latency can be hidden. This memory latency hiding advantage is also one of the major reasons why modern high-performance RISC cores moved away from the RISC-style pipeline to adopt the same massively out-of-order pipelines as x86. They just have a large advantage in the fact that their ISA is already load/store. |
I know that I have a tendency to over-use the word "RISC" (basically for anything that is not a 70's CISC state-machine).
> or they allude to overly broad definitions like "Everything that's not CISC"
Yup, that's me ;-)
BTW, w.r.t. nomenclature, I make a clear distinction between "architecture" and "microarchitecture" (even if I mix up contexts at times).
> But this does mean I'm arguing that anything with out-of-order execution is not RISC.
I think that this is where we disagree. E.g. the POWER1 (1990) was the first (non-mainframe) out-of-order CPU with register renaming, and it was a RISC. The Alpha 21264 (1998) was definitely both RISC and OoO. One of the first x86 implementations with uOPs translation, the NexGen Nx686 (1995, later AMD K6), was also out-of-order, and was said to have a RISC microarchitecture (based on RISC86). Not as complex as modern cores, but drawing the line at OoO does not work for me.
Historically RISC meant many things, and as you said the early RISC machines had many design quirks that did not stand the test of time (in particular exposing too many microarchitectural details in the ISA - something that proved useful in low-power VLIW DSP:s, though).
However, the key takeaway that has stood the test of time is an instruction set that enables fully pipelined execution. In the 70's and 80's, using a load-store (register-register) ISA was the only viable way out of multi-cycle instructions. To me, the principle to design instructions for pipelined execution is the main point of RISC, and the key point where it differs from CISC ISA:s, which were specifically designed for state-machine style microarchitectures (I don't have a better term).
In the 90's the same principles were implemented at the microarchitecture level (Nx586, K6, P6), without changing the architecture (i.e. the x86 ISA was kept on the surface).
Out-of-order happened to arrive to microprocessors at around the same time, so it was an obvious natural development for all high-performance CPU:s, regardless of "RISC" branding or not. It was the way forward to increase ILP (yes, you can do superscalar in-order too, but there's a rather annoying limit to parallelism there). It just so happened that cracking x86 instructions into multiple uOPs was a good way to make better use of OoO as well (in fact, that kind of cracking was exactly what was proposed by the RISC crowd in the 70's and 80's, but at the ISA level rather than at the microarchitecture level).
> I'm going to call this arch the "massively out-of-order" for now.
Mitch Alsup calls it GBOoO (Great Big Out-of-Order). There are some merits to making that distinction - but like you I would like to see a widely adopted definition.
Yet, I will continue to use terms like "RISC-like pipeline". I guess one of my main motivations is to make people understand that x86 is no longer CISC "under the hood". Especially with RISC-V being popularized, a new generation comes asking questions about RISC vs CISC (as in RISC-V vs x86), without understanding the difference between an architecture and a microarchitecture.
For most intents and purposes most GBOoO microarchitectures are comparable when it comes to the execution pipeline, regardless of which ISA they are using. The main differences are in the front end - but even there many of the principles are the same (it's mostly a question of how much effort needs to be spent on different parts - like decoding, prediction, caching).