| HN Mirror

Yes, consider vector floating point fused multiply-add (FMA). On a typical AVX implementation like Skylake, this instruction has a latency of 4 cycles and a throughput of 2 instructions per cycle. To avoid stalls, you'd need 4 * 2 = 8 instructions to run independently, and 8 architectural registers to simply store the results. You could store the results onto the stack and reuse the same architectural registers, but usually you want to use the values immediately in the next loop iteration (eg. matrix multiply) so this would be expensive. You probably want a few more architectural registers (at the very least 2, up to 16) to hold the inputs as well.