I thought modern speculative cpu’s had way more registers than you can normally access. Why does it reserve these registers for speculative execution instead of exposing them to the program if it needs them?
That's not really how registers or speculative execution works. Intuitively, you can think of assembly as trying to describe a graph of instruction dependencies. Having 16 registers in the ISA allows you to have 16 live outputs at any given "time". Speculative execution allows instructions to execute out-of-order, and to enable this, it has ~140 registers that allow it to have 140 live outputs at once, so that it can run some code while a really long load is waiting for its data.
From the ISA perspective, however, adding more registers means you have to spend more bits naming a register. With 16 registers, you need 12 bits of your instruction just to name the operands of a typical 3-address instruction (rA = rB op rC). With 128 registers, that is now a whopping 21 bits, which means code density is a more pressing issue.
I get that there is a tradeoff with code density, although if you could have an encoding scheme or extended register mode to alleviate this. I was just thinking that if you have e.g a loop where you run out of registers, since you only have 16, then the compiler will swap the values to memory and reuse that register which creates an instruction dependency that doesnt really have to exist.
If the compiler could use the hidden registers, then the cpu would know that it could run this instruction ahead of time.
It is probably not worth it, since it adds a lot of complexity to an already complex system, which is why it isn’t done.
All AMD64 CPUs support at least SSE2 which means they have 16 (not 15 or 14!) XMM registers they can spill to. This is just as fast as a move between two GPRs.
There is a difference between registers and register names.
he AMD64 architecture only has 15 general-purpose registers (because the stack pointer is mostly treated as if it were a GPR as well). It is customary to use one of those (bp/ebp/rbp depending on mode) as a base pointer register. That leaves 14 GPR register names.
The physical CPU the code runs on might have 200 physical registers -- those are the ones that matter for speculative, out-of-order execution -- but the code itself can only refer to 14 (or 15) GPRs at a time and has to include instructions to transfer values to/from memory or to/from XMM registers if that's not enough. Those extra instructions take up space + might slow the code down.
The number of registers available to the program is fixed in the instruction set. The program cannot address more registers without recompiling it to an extended instruction set.
Because then it would need more registers for the other purpose?
But actually the decision about which general purpose registers to use for what is made at compile time (hence we're discussion a compiler flag here, the frame pointer is not a hardware dictated feature), so the question is actually kind of moot. If the compiler is out of registers to allocate and instead uses the stack, the CPU isn't reasonably going to be able to undo that.
Sure, but wouldn’t it make sense to extend the instruction set to allow the compiler to use these registers instead of reserving them for speculative / out-of-order execution? It was just a thought i had after watching a talk by a compiler guy: https://youtu.be/2EWejmkKlxs?feature=shared&t=2409
jcranmer got it right. Read that reply (and mine). And then maybe rewatch watch Chandler Carruth says.
The current practice allows for CPUs to transparently increase their physical register count (to gain performance) and still run old code -- and older CPUs can still run new code. That's usually quite practical...
Adding more register names takes more bits for the register numbers -- which leads to larger instructions. It also leads to more complicated encodings if we want backwards compatibility. AMD64 does that by adding an optional prefix byte that carries a payload of 4 more instruction bits. That's one bit each for the three possible register names encoded in a traditional IA32 instruction + a bit to indicate whether to operate on 32-bit or 64-bit data (the actual rules are a bit more complex). Intel published a whitepaper recently suggesting a future encoding with a different (optional) prefix that encodes 8 more bits -- so each of the three register names can be extended to 5 bits (32 register names). It all ends up being quite complicated + new code won't run on older CPUs, which is not great.
I think you are suggesting not just bigger register names but also doing away with register renaming -- that would be... less than entirely useful because you would lose almost all your out-of-order capability and thereby almost all your ability to hide cache misses. Cache misses are very, very hard to predict statically (before actually running the code on a real CPU with real data) so good luck trying to do magic ahead-of-time allocation of those registers...
From the ISA perspective, however, adding more registers means you have to spend more bits naming a register. With 16 registers, you need 12 bits of your instruction just to name the operands of a typical 3-address instruction (rA = rB op rC). With 128 registers, that is now a whopping 21 bits, which means code density is a more pressing issue.