| Thanks, I have read the Agner documents before. I will dig around some more and get updated. Anyway, I found this, regarding RMW (for Ice/Tiger Lake): > Most instructions with a memory operand are split into multiple μops at the allocation stage. Read-modify instructions, such as add eax,[rbx], are split into two μops, one for address calculation and memory read, and one for the addition. Read-modify-write instructions, such as add [rbx],eax, are split into four μops. I read it as a instructions that use memory operands (other than simple mov instructions) are usually split into at least two uOPs, which makes perfect sense for a load/store pipeline. > So it's far from a proper load/store pipeline. And when you think about it, that makes sense, x86 isn't a load/store ISA so it would be wasteful to not have special accommodations for it. The way I see it, modern x86 microarchitectures are load/store. My definition of load/store is that all instructions/operations that flow through the execution part of the pipeline can either load/store data OR perform operations on registers, not both (except possibly edge cases like calculating an address or writing back an updated address to a register). That is by far the most efficient way to implement a CPU pipeline: You don't want to read data in one pipeline stage, use the data in an ALU in a later stage, and possibly write data in an even later stage. That would drastically increase instruction latency and/or require duplication of resources. This is, AFAIK, one of the main advantages and probably the raison d'être for uOPs is the first place: translate x86 instructions into uOPs (multiple ones for instructions that access memory) so that the pipeline can be implemented as a traditional load/store pipeline. In a way the x86 front end is similar to software binary translation (a'la Transmeta, NVIDIA Denver or Apple Rosetta 2). It's fairly complex, and the prime objective is to take code for a legacy ISA and transform it into something that can run in a pipeline that the ISA was originally not intended to run in. By doing the translation in hardware you avoid the latencies inherent to software translation (JIT or AOT), but the costs are unavoidable (particularly silicon area and power consumption). |
But that seems to be an incorrect perspective. The pipeline's native internal ISA appears to very much be the "fused-uop". It's the only metric which matters for most of the pipeline, decode limits are specified in fused-uops, the uop cache stores fused uops, the ROB only uses a single entry for fused-uop. The only part of the pipeline were that deals with unfused-uops is the scheduler and the execution units themselves. Even the retire stage works on fused-uops.
It's probably better to think of the pipeline's native ISA as an instruction that can sometimes be scheduled to two execution units. It's almost a very basic VLIW arch, if you ignore the dynamic scheduling.
Sure, the execution units are load/store. And the scheduling is load/store. But I don't think that's enough to label the entire pipeline as load/store since absolutely every other part of the pipeline uses fused-uops and is therefore not load/store.
> This is, AFAIK, one of the main advantages and probably the raison d'être for uOPs is the first place: translate x86 instructions into uOPs so that the pipeline can be implemented as a traditional load/store pipeline.
I'm really not a fan of the "translate" terminology being used to describe modern x86 pipelines. It's not quite wrong, but it does seem to mislead people (especially RISC fans) into overstating the nature of the transformation.
It's nothing like software binary translation (especially something like Rosetta 2), the transforms are far simpler. It's not like Intel took an off-the-shelf RISC architecture and limited their efforts to just designing a translation frontend for it (the few examples of direct hardware translation, like NVIDIA Denver and Itanium have pretty horrible performance in that mode).
No, they designed and evolved the pipeline and its internal ISA to directly match the x86 ISA they needed to run.
All the front end is really doing is regularising the encoding to something sane and splitting up some of the more complex legacy instructions. Instructions with memory operands are converted to a single fused-uop. The front-end only splits the read-modify-write instructions into two fused-uops. The transform into proper load/store form doesn't happen until much further down the pipeline as the fused-uop gets inserted into the scheduler.
I have quite a bit of experience writing software binary translation software, and I ensure you such translations are significantly more complex than the transforms you find inside an x86 pipeline.
> Thanks, I have read the Agner documents before. I will dig around some more and get updated.
I swear every single time I read them (or usually just parts of them) I learn more about x86 microarches (and CPU design in general). It's not something that can be absorbed in a single pass.