| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by joosters 4676 days ago

Has anyone done a study on the optimal number of registers to have?

The website answers the register question well, but leads to a further question: If registers are so great, why stick with just 16/32/64/n registers? Why not have more? After all, x86-64 and ARM64 decided that having more suited them.

In the end it must come down to a compromise, with the downsides of having more registers possibly being some of the following:

* Increased instruction set size (having to encode a larger register space in the bit patterns of each instruction)

* Increased latency for interrupts? e.g. if your CPU has 1000 registers and an interrupt occurs, you're going to end up having to save all those 1000 registers somewhere. There could be some HW-assist but you'll pay the price somewhere.

* Extra cost for saving registers in functions. Sure, depends upon the ABI as some registers will be 'scratch' and not preserved between function calls, but if you've got more registers you'll end up wanting to save more of them.

* Algorithms might not need all the registers. I wonder what algorithm uses 20 live variables? 50? 100? etc. At some point, those extra registers could be unused.

* Registers still need to be 'spilled' to memory. In an extreme case, you could imagine compiling a small program where every variable maps to a unique register. Ultimate speed! But asides from that optimal case, you'll end up still having to write registers back to memory. It makes no difference having 100 registers if you store the results of every computation...

Anyway, that's all speculation. I was wondering if someone had done a study. You could construct a virtual, bespoke CPU with n registers, then make gcc compile some SPEC benchmarks using the ISA and model it to see how efficient having an extra register makes it. You could graph registers vs simulated runtime and see where the sweet spot is.

8 comments

Scaevolus 4676 days ago

Yes, it's been studied. You rapidly run into diminishing returns.

http://arxiv.org/ftp/arxiv/papers/1205/1205.1871.pdf

Here's a good thread discussing this: https://groups.google.com/forum/#!searchin/comp.arch/number$...

joosters 4676 days ago

Awesome! Thank you for the link.

brigade 4676 days ago

The studies would vary over time because CPU design and bottle necks have changed. Early designs were of course limited by transistor count, now we have OoOe and physical registers are limited by muxers and latency (see the presentations by the mill CPU guy [1]

Saving registers in functions is mostly irrelevant - you only save what you'd use, so saving more means fewer spills within the function.

Saving on context switches (interrupts alone aren't a big deal) was indeed a problem back when AltiVec was designed, thus it has a special register to keep track of which registers need to be saved. In modern designs this is less of a problem, between higher frequencies, multiple cores, and the other effects of a context switch dominating (effective flush of l1 cache and predictors).

The interesting bits nowadays are that load/store is expensive power-wise, which was what ARM identified as the major motivation behind having 32 registers (fewer spills in functions) and OoOe designs.

[1] http://m.youtube.com/watch?v=QGw-cy0ylCc&desktop_uri=%2Fwatc...

joosters 4676 days ago

Saving registers in functions is mostly irrelevant - you only save what you'd use, so saving more means fewer spills within the function.

Ah, but I'm sure that if you have more registers available, you'd use more registers. Up to a certain point. But what point? Just how many registers?

brigade 4676 days ago

No one uses more registers just to use more registers - in OoOe designs the main reason to use more registers is to reduce spilling and reloading. So in effect a compiler isn't going to use a register it has to save, unless in doing so it saves a spill+reload, which would result in the same number of load/store as without the additional register.

In-order designs have more reasons to use more registers, but again they aren't going to use more registers unless they gain something.

masklinn 4676 days ago

> The website answers the register question well, but leads to a further question: If registers are so great, why stick with just 16/32/64/n registers?

TFA gives at least one reason:

> Registers use an expensive and power-hungry active design. They're continuously powered, and when reading them, this means that their value can be read quickly. Reading a register bit is a matter of activating the right transistor and then waiting a short time for the powerful register hardware to push the read line to the appropriate state.

Registers use up a lot of silicon, and consume a lot of energy to power it. They also need to stay physically close to computing circuits, otherwise you end up with an L1 cache more than a register.

Furthermore, although ISA expose a number of registers A, OOO architectures (and their friends parallel and speculative executions) pretty much require the CPU to have > A registers and do register renaming, which lowers the number of registers the ISA can define. For instance the Alpha ISA defines 32 integer registers, but the Alpha 21264 had 80 physical integer registers.

joosters 4676 days ago

That's definitely another factor. Again though, I doubt it's the limiting one. No-one (as far as I know) has produced a power-hungry CPU with (say) 5000 registers on it.

terrymah 4676 days ago

I've heard that modern Intel processors have 100 < x < 200 physical registers. I'm not sure they actually document the exact number.

hornetblack 4675 days ago

Itanium has at least 256. (128 Integer + 128 Float + 128 predicate (1 bit), which are essentially flip-flops.)

solarexplorer 4676 days ago

Register windows are a way to put 1000 registers in a CPU. See the SPARC and Itanium instruction sets for how this can be done. There are also plenty of studies about both.

Vector registers are another way to use 1000 registers.

But directly coding 1000 registers into each instruction does not seem to be such a good idea. You might as well use a 1st level cache. The difference between the cache and the register file ist mostly how the instruction set architecture references it. Registers are usually easier to acces because each one has a single name and the CPU can detect dependencies and conflicts easily. Memory accesses and caches are more complex because you need to calculate the addresses before you can detect dependencies/conflicts.

PD: Yet another way to use 1000 registers is massive multi-threading like the Tera MTA.

terrymah 4676 days ago

It's complicated, but modern processors actually do have many more registers than you can name in the instructions. They use things like "register renaming" to avoid false conflicts between instructions.

Registers that you name in assembly != physical registers. And when you use a register in two different instructions, you won't necessarily get the same physical register each time.

edderly 4676 days ago

I thought this was an interesting insight in to that: http://ootbcomp.com/docs/belt/index.html

vilda 4676 days ago

Note that the actual number of registers is considerably different than the number of registers you can access through instruction set. They are used via register renaming and optimizations of complex instructions.

joosters 4676 days ago

Yes. As other commentors have said, if you are doing out-of-order execution well, the CPU will have many more 'hidden' registers and do register renaming to use them. But this has an interesting interaction with compilers.

Say you have a simple function that is going to add 1 to a bunch of variables. In an ARM-like assembly code, this could be written as:

  LDR r1, [r0, #0]
  ADD r1, r1, #1
  STR r1, [r0, #0]
  LDR r1, [r0, #4]
  ADD r1, r1, #1
  STR r1, [r0, #4]
  LDR r1, [r0, #8]
  ADD r1, r1, #1
  STR r1, [r0, #8]

Now, if your CPU can do OoOE, it can spot that register r1 is used for three independent loads, adds and stores, and can internally use three different registers for them, allowing the operations to be done in parallel. But, equally, the compiler could have written the code as:

  LDR r1, [r0, #0]
  ADD r1, r1, #1
  STR r1, [r0, #0]
  LDR r2, [r0, #4]
  ADD r2, r2, #1
  STR r2, [r0, #4]
  LDR r3, [r0, #8]
  ADD r3, r3, #1
  STR r3, [r0, #8]

Compilers and register renaming are fighting each other. In traditional compiler writing, you try to minimise the register usage and output the first code listing. But if you have plenty of registers, you could output the second code instead, and let the CPU do parallel execution without the need for register renaming.

In other words, once you have enough 'real' registers does it get rid of the need for register renaming? Intel added it to their pentiums to improve existing x86 code, but I wonder if it has that much of a benefit with newer ISAs that have 'enough' registers and properly tuned compilers?

brigade 4676 days ago

You still need OoOe to execute your second example optimally since you didn't schedule the instructions, which points to why OoOe isn't going away - there are going to be code sequences that the compiler cannot schedule optimally, particularly around branches. Additionally, cache misses are impossible to predict statically, and OoOe helps hide those.

And no one does OoOe without register renaming.

joosters 4676 days ago

Yeah, I avoided any other changes to avoid confusing the issue. But any reordering I could have done, the compiler could have done too. Your point about branches is fair though, as the 'active' renamed registers after a branch can only be known at runtime.

Still, I wonder whether some of the features of modern CPUs could be dropped if it wasn't for legacy code. On the other hand, Itanium tried to push the parallelism work onto the compiler and look where that ended up!

Symmetry 4676 days ago

Most high performance CPUs will have ~100 physical registers or so, possibly divided up in multiple segments.

But abstracting those you have your architectural registers that are presented by your ISA, and the CPU uses register renaming to map those onto the physical registers.

The tradeoffs involving ISA registers are more intense. You have to load and store all of them on thread swaps, but that's pretty tivial. More importantly the bits you have to use to specify which register you're using are bits that you're paying in every single instruction you have, increasing the size of your executable and the pressure on your caches.

Different sorts of architectures have their sweet spots at different places. In order processors doing lots of matrix math and such benefit from lots of architectural registers, the Itanium had 128 integer and 128 floating point registers and that was the right amount for a VLIW architecture with it's features. Modern GPUs are similar.

On the other hand, your typical OoO CPU will have either 16 or 32 registers you can address at a time, and that seems to be close to optimal. It's hard to say since instructions come in discrete chunks and your number of registers has to be a power of 2 as a practical matter.

codex 4676 days ago

Fundamentally, having more registers increases the speed of light delays in accessing the register. If it did not, we would just operate on main memory itself. However, two few registers and you lose the ability to perform complex computations efficiently. So I believe it is, indeed, a compromise between speed and a need to maintain scratch state. I would be surprised if Intel and AMD didn't constantly run simulations of common computations in an effort to find the optimal size of all on-chip structures.

joosters 4676 days ago

That's definitely another factor but I suspect it isn't the limiting factor. Sure, design a chip with a million registers and you'll end up constructing them like RAM. But with orders-of-magnitude fewer registers, 16 or 32 or whatever, the size of the register banks on the CPU can't be that significant to incur speed-of-light style delays, surely?

sliverstorm 4676 days ago

WITH 16x fewer registers, that equates to about 1 chip's worth of registers (remember a stick of DRAM often has 8-16 individual chips on it). While this is already clearly a huge problem, consider additionally that DRAM is made with trench capacitors, unlike SRAM. DRAM is dramatically slower and more dense than SRAM. So we either sacrifice speed, or bloat our one-chip's-worth of area by a few factors, say x4-8.

Then there's practicalities like sense amp design. Large register arrays are not read in a digital fashion, and current L2 and L3 sizes already press their sense amps to their limits. DRAM also uses sense amps, but the amps are again slower and larger.

http://en.wikipedia.org/wiki/Sense_amplifier

masklinn 4676 days ago

Probably not, but there are definitely delay effects at play or L2 and L3 cache would be unnecessary, you could just have humongous L1s.