Hacker News new | ask | show | jobs
by easde 2033 days ago
Yes, consider vector floating point fused multiply-add (FMA). On a typical AVX implementation like Skylake, this instruction has a latency of 4 cycles and a throughput of 2 instructions per cycle. To avoid stalls, you'd need 4 * 2 = 8 instructions to run independently, and 8 architectural registers to simply store the results. You could store the results onto the stack and reuse the same architectural registers, but usually you want to use the values immediately in the next loop iteration (eg. matrix multiply) so this would be expensive. You probably want a few more architectural registers (at the very least 2, up to 16) to hold the inputs as well.
1 comments

exactly this. Thanks for the practical example.

The reason it is less relevant for integer computations is that integer ops have normally lower latency and tend to have shorter loop carried dependency chains.