Hacker News new | ask | show | jobs
by Const-me 1884 days ago
> Here is the ISPC implementation

Line 36 computes ymm6 = (ymm6 * mem) + ymm4, the next instruction on line 37 computes ymm6 = (ymm8 * mem) + ymm6

These two instructions form a dependency chain. The CPU can’t start the instruction on line 37 before the one on line 36 has made a result. That gonna take 5-6 CPU cycles depending on CPU model. Same happens for ymm5 vector between instructions on line 38 and 41, and in a few other places.

In the reference code all 4 FMA instructions in the body of the loop are independent from each other, a CPU will run all 4 of them in parallel. The data dependencies are across loop iterations, only the complete loop is limited to 4-5 cycles/iteration. That’s OK because the throughput limit (probably not the FMA throughput though, I think load ports throughput is saturated before FMA, especially for unaligned inputs) is smaller than that.

1 comments

Oh right, I didn't think of looking for that, guess you're right and doing things by hand is still better
It’s not terribly bad because CPUs are out-of-order. As far as I can tell, there’s no single dependency chain over all instructions in the loop body, some of these FMAs gonna run in parallel in your ISPC version. Still, I would expect manually-vectorized code to be slightly faster.