Hacker News new | ask | show | jobs
by nx7487 2030 days ago
Oh: to clarify, you mean that the compiler could just use stack slots for everything, but some instructions are only allowed to operate on architectural registers, right? If you have to execute a lot of those instructions, the number of architectural registers can be the bottleneck in performance?
1 comments

Yes, consider vector floating point fused multiply-add (FMA). On a typical AVX implementation like Skylake, this instruction has a latency of 4 cycles and a throughput of 2 instructions per cycle. To avoid stalls, you'd need 4 * 2 = 8 instructions to run independently, and 8 architectural registers to simply store the results. You could store the results onto the stack and reuse the same architectural registers, but usually you want to use the values immediately in the next loop iteration (eg. matrix multiply) so this would be expensive. You probably want a few more architectural registers (at the very least 2, up to 16) to hold the inputs as well.
exactly this. Thanks for the practical example.

The reason it is less relevant for integer computations is that integer ops have normally lower latency and tend to have shorter loop carried dependency chains.