| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by philh 1687 days ago

Author in the old thread (https://news.ycombinator.com/item?id=19262249) says

> An x86-64 CPU has sixteen integer registers, but 100-200 physical integer registers. Every time an instruction writes to, say, RAX the renamer chooses an available physical register and does the write to it, recording the fact that RAX is now physical-register #137. This allows the breaking of dependency chains, thus allowing execution parallelism.

I'm curious why they have so many more physical registers than... logical? registers. I have a couple of guesses:

* Physical registers are physically cheaper to add than logical registers.

* Adding logical registers breaks backwards compatibility, or at best means you get no speedup on things written (/compiled) for fewer logical registers. Adding physical registers lets you improve performance without recompiling.

* Adding logical registers increases complexity for people writing assembly and/or compilers. Adding physical registers moves that complexity to people designing CPUs.

Are some of these correct? Other reasons I'm missing?

12 comments

ridiculous_fish 1687 days ago

Don't think of %eax as a real register. Think of it as a tag in a compressed dataflow graph. The compression is performed by the compiler's register allocator, and the decompression is performed by the CPU's register renaming.

A compiler's IR is a directed graph, where nodes are instructions and tagged by an assigned register. It would be pleasant to assign each node a distinct register, but then machine instructions would be unacceptably large. So the compiler's register allocator compresses the graph, by finding nodes that do not interfere and assigning them the same register.

The CPU's register renamer then reinflates this graph, by inspecting the dataflow between instructions. If two instructions share a register tag, but the second instruction has no dependence on the first, then they may be assigned different physical registers.

`xor eax, eax` has no dependence on any instruction, so it can be specially recognized as allocating a new physical register. In this way of thinking, `xor eax, eax` doesn't zero anything, but is like malloc: it produces a fresh place to read/write, that doesn't alias anything else.

brucedawson 1687 days ago

Adding more logical registers is compatibility breaking. And, since you have to encode the register specifier in the instruction it means larger instructions (hence the compatibility breaking) which makes reading and decoding instructions slower.

And, regardless of how many logical registers you have you need to have more physical registers. These are needed for out-of-order (OOO) execution and speculative execution. An OOO super-scalar speculative CPU can have hundreds of instructions in flight and these instructions all need physical registers to work on. If you don't have excess physical registers you can't do OOO or speculative execution.

CalChris 1687 days ago

Adding more logical registers doesn't have to break application compatibility. Intel added x87, MMX ... extensions with their new register sets all without breaking compatibility. They even doubled the integer register set in x86_64 with the REX prefix. New programs could use these features and their register sets without existing programs being broken.

What register renaming allows is to increase the performance of both new and existing programs, which is no mean feat. It allows the CPU scheduler to search for more out-of-order parallelism rather than relying on the compiler to find in-order parallelism.

This binary compatibility doesn't seem very important now, don't break userspace excepted, but it was then. Compatibility made IBM and Intel hundreds of billions of dollars.

monocasa 1687 days ago

Adding those registers each time at the bare minimum broke OS compatibility in order to enable them, and required kernel changes to save and restore the new architectural registers. Adding to the physical register file allows existing code (including kernel space) to take advantage of it. There's been some extensions lately like xsave that try to address that, but they're not fully embraced by major kernels AFAIK.

CalChris 1687 days ago

Tomasulo's register renaming algorithm (1967) comes from an era when you bought the computer and the operating system together. So OS compatibility was their problem. That's not our era but the resulting in-order vs out-of-order war is long over.

Register renaming enabled out-of-order execution. The 90s were a competition between in-order compiler based scheduling and out-of-order CPUs. The in-order proponents said that out-of-order was too power hungry, too complex and wouldn't scale. Well, it did. Even the Itanium which was the great in-order hope, its last microarchitecture, Poulson, had out-of-order execution.

Ultimately, out-of-order won the war but in-order survives for low power low complexity designs; the A53 is in-order. Skylake Server has 180 physical registers with an out-of-order search window of 224.

https://www.primeline-solutions.com/media/wysiwyg/news-press...

monocasa 1687 days ago

In the 1960s, os compat was your problem as the end user too. There wasn't the strict divide between OS code and user code in the same way. The 360/91 that Tomasulo's algorithm originally shipped on didn't even have an MMU to separate your code from the kernel.

Additionally, compiler tech wasn't anywhere near where it was today, and high perf code was written in ASM. Therefore existing code would have to be rewritten to use more registers, but the 360/91 ran existing code just fine (which was very important for the 360/91's main customers).

woodruffw 1687 days ago

x87 and MMX's register encodings exist in (mostly) separate parts of the x86 operand encoding map. That's in contrast to GPRs, which have to squeeze into 3 (sometimes 4, with the REX prefix) bits.

That's where the incompatibility comes from -- x86-64 required an entirely new prefix to merely double the GPRs; adding a few hundred more would require some very substantial changes to the opcode map and all decoders already out there.

CalChris 1687 days ago

When x87+MMX were added, existing programs ran unchanged. When x86-64 doubled the register sets, many existing programs ran unchanged (some features were dropped). Compatibility was largely maintained. That compatibility was what AMD wagged in Intel's face when Intel was trying to pivot to Itanium. Intel had to then take the walk of shame and adopt AMD's approach.

Seriously, Intel took a long view towards this. x87 was a wart on the side of mole and still its unholy marriage with MMX (they shared a register set) allowed existing programs to run while creating a compatibility barrier to competitors. Competitors had to be compatible and bug compatible. The guy tasked with doing this at Transmeta almost had a nervous breakdown, not from compatibility (easy) but from bug compatibility.

IBM 360 programs still run on the Z architecture.

chasil 1687 days ago

The original 8087 prompted the IEE-754 floating point standard, which had a profound impact on the entire field of computer science.

https://news.ycombinator.com/item?id=17767925

https://news.ycombinator.com/item?id=23205225

https://news.ycombinator.com/item?id=23362673

https://news.ycombinator.com/item?id=18107165

woodruffw 1687 days ago

I'm not saying it's impossible! They certainly have plenty of space in the EVEX scheme. But extending the GPRs is a much bigger lift, tooling-wise, than is adding a relatively disjoint ISA extension. Even if they can do it while preserving older encodings, it's just another speedbump at a time when Intel is probably anxious to make x86-64 as frictionless as possible.

Besides, register renaming seems to be working splendidly at the uarch level. Why complicate the architectural model when the gains are already present?

monocasa 1686 days ago

> Even if they can do it while preserving older encodings, it's just another speedbump at a time when Intel is probably anxious to make x86-64 as frictionless as possible.

Just wanted to throw out there that it was AMD that came up with x86-64's ISA rather than Intel. Intel was still pushing Itanium hard at the time.

garaetjjte 1687 days ago

x87 instructions weren't dropped, they are still available on x86_64.

CalChris 1687 days ago

Yeah, you're right. 64-Bit Mode Valid Thanks.

CalChris 1687 days ago

> I'm curious why they have so many more physical registers than... logical? registers

Register renaming allows instructions to be executed out-of-order [1] which allows for more instruction throughput.

This goes back to 1967 and to the IBM 360/91 with its 4 floating point registers. That's not many registers but Moore's law was making more transistors available. The problem was how to use these transistors to get more throughput from existing programs without changing the ISA and (potentially) breaking compatibility.

The solution was Tomasulo's algorithm [2] which allowed (few) architectural registers to be renamed to (many) physical registers.

  original         renamed           reordered
  mov RAX, 1       mov PHYS1, 1      mov PHYS1, 1; mov RAX, [RCX]
  add RBX, RAX     add RBX, PHYS1    add RBX, PHYS1
  mov RAX, [RCX]   mov RAX, [RCX]

The first and third instructions can be executed at the same time on independent functional units. The third is out-of-order with respect to the second.

[1] https://inst.eecs.berkeley.edu/~cs152/sp20/lectures/L10-Comp...

[2] https://en.wikipedia.org/wiki/Tomasulo_algorithm

cogman10 1687 days ago

I've not seen this in other responses, but an answer to your question is every logical register adds overhead to context switching by the operating system.

The OS has to store and load all registers whenever it decides to switch which thread is processing. 100 more logical registers means 100 more locations the OS has to keep track of.

This is part of the reason why new SIMD instruction sets need OS support before you can start using them.

kps 1687 days ago

> Adding logical registers breaks backwards compatibility

This, plus adding logical registers increases instruction size and therefore decreases the number of instructions that can be fetched with a given memory bandwidth.

enragedcacti 1687 days ago

It can definitely be beneficial to add logical registers. When AMD designed x86-64 they doubled the number of general logical registers up to 16. As other commenters have said, unless you are already making breaking changes, increasing the number of logical registers is probably not worth it.

msla 1687 days ago

Having more physical registers than logical means the CPU can do optimizations, opcodes don't have to be as big (it takes more bits to encode more registers), compatibility with older binary code is maintained, and CPUs at different price points can have different numbers of physical registers while all being able to run the same binaries.

(I don't know if any manufacturer actually does that last thing, however.)

guerrilla 1687 days ago

> * Adding logical registers breaks backwards compatibility, or at best means you get no speedup on things written (/compiled) for fewer logical registers. Adding physical registers lets you improve performance without recompiling.

> * Adding logical registers increases complexity for people writing assembly and/or compilers. Adding physical registers moves that complexity to people designing CPUs.

These two points are the same thing: compatibility and compatibility is what Intel and AMD have lived on from day one with x86. It's why we still live with this really weird instruction set with all of its historical oddities. Certain features of real mode weren't removed until long into the 64-bit era. Adding things isn't any better: If you wanted to add more add more registers, you'd have to change instruction encoding and the instruction space is finite (actually limited to 15 bytes.) That would be rather disruptive.

tenebrisalietum 1687 days ago

I think your second point hits it and is the primary benefit to hiding the microarchitecture layer - it can be improved and existing code will benefit from it.

Basically Intel is saying if you had 200 GPRs, you couldn't do better at using the free ones than the CPU scheduler/decoder.

> Adding *architecturally visible* registers increases complexity for people writing assembly and/or compilers.

More registers just makes your code less likely to have to shuffle stuff to and back from RAM - which is where stuff will go if you don't have registers.

It's always faster for a CPU to access registers within itself than have to talk over a bus to a memory. Even when RAM was the same speed as CPUs (8-bit era) you would still save a cycle or two.

brucedawson 1687 days ago

Having more logical/architectural registers is great except for a few costs:

1) More bits to encode register numbers in instructions. Doubling the number of logical registers costs another two or three bits depending on how many registers are referenced in an instruction

2) Logical registers have to be saved on context switches

3) Logical registers have to be saved around function calls. Either the caller or the callee has to save (or not use) registers, and most functions are both callers and callees. That is, if you are not a leaf-node function then every register you use you have to first save to the stack, or else assume that the functions you call will trash it. Thus, more registers have diminishing returns.

4) No matter how many logical registers you have you _always_ want to have an order of magnitude more physical registers, because otherwise you can't implement OOO or speculative execution.

Point #4 is probably the most critical because I think what people are really asking is why are there more physical than logical registers, and OOO/speculative execution is the answer.

captainmuon 1687 days ago

I wonder how things like register renaming (or pipelining) are implemented. It would seem difficult even in a high level language, but they do it inside the processor. Is this in microcode that runs on the "actual" processor? Or is it in hardware? Do they write th algorithm in a language like VHDL or Verilog?

brucedawson 1687 days ago

Register renaming is implemented in hardware. Because it is used on every instruction it is on the critical path and is probably hand-optimized. Here is some more reading on this topic:

https://en.wikipedia.org/wiki/Register_renaming

monocasa 1687 days ago

Renaming isn't really done in microcode. Microcode is just another source for ops that get renamed. All of the renaming happens in hardware, and boils down to a handful of tables inside the processor.

comex 1686 days ago

Some cores are open source and you can see for yourself.

Rename logic from BOOM, a RISC-V core written in a DSL embedded in Scala:

https://github.com/riscv-boom/riscv-boom/blob/1ef2bc6f6c98e5...

From RSD, a core designed for FPGAs written in SystemVerilog:

https://github.com/rsd-devel/rsd/blob/master/Processor/Src/R...

And then there's Alibaba's recently open-sourced XuanTie C910, which contains this Verilog… which is completely unreadable. Seems like it was produced by some kind of code generator that they didn't open-source?

https://github.com/T-head-Semi/openc910/blob/d4a3b947ec9bb8f...

dnautics 1687 days ago

All of the above, I believe.

dexen 1687 days ago

Beyond reasons & limitations explained in sibling posts, having large number of (logical / instruction-level) registers also inflates instruction size, and thus diminishes instruction density, and thus lowers performance - so there is a trade-off between that and large number of registers. Hear me out.

The CPU has limited memory bandwidth; the larger instruction size, the more bytes needs to be loaded from memory to execute the instruction. Same with cache size - the more space an instruction takes, the lower the amount of instructions that is cached. Lastly, there's the complexity & latency of the instruction decoder. This possible performance loss is averted by keeping instructions short and instruction set "dense".

Any instruction that refers to a register needs certain amount of bits in the operand portion to indicate which specific register(s) is to be used [1][2][3]. As example, in case of 8-register x86 the operand generally uses 3 bits just to indicate which register to use. In case of 16 register x86_64, it takes 4 bits. If we wanted to use all 200 physical register, that would require whole 8 bits reserved in the instruction just to indicate the register to use. Certain instructions - data transfer, algebra & bitwise operations, comparisons, etc. - naturally use two or more registers, so multiply that accordingly.

Since using this many registers gives only diminishing return in terms of performance (and also requires very heavy lifting on compiler's part[4]), the trade-off selected is that the compiler uses architecture-defined small number of registers, and the processor at runtime is able to speed up some code using the spare registers for instruction-level execution parallelism.

[Edit]

There's one more common circumstance where large number of registers is undesirable: a change of execution context (thread switch; process switch; interrupt). Typically all architecturally-visible registers are saved to memory on a change of context and new set is loaded for the new context. The more registers there are, the more work is to be done. Since the hardware registers are managed directly by CPU and serve as more of cache than directly accessed register, they don't need to be stored to memory.

[1] Aside of certain specialized instructions that implicitly use a particular register; for example in x86 many instructions implicitly use the FLAGS register; DIV/IDIV integer division implicitly uses AX and DX registers.

[2] Aside of certain instruction prefixes that influence which register is used; for example in x86 that would be segment register overrides.

[3] Aside of certain architectures where registers were organized in a "file" and available only through a "window" - i.e., an implicit context, implicit register addressing base; instruction operands referred to registers relative to the current window, and the whole window could be shifted by specialized instructions. Typically shifted on function enter/leave and similar. This was more-or-less the whole "hardware registers" being exposed at architecture level, however in a somewhat constrained / instruction-dense way.

[4] Arranging which registers to use, which to spill to memory etc. is non-trivial work for compiler, and the complexity grows super-linearly with the number of registers.

foobiekr 1686 days ago

the ISA defines logical registers

the implementation defines physical registers

any implementation is permissible as long as it conforms to the ISA