Hacker News new | ask | show | jobs
by varajelle 1485 days ago
The problem here is that the additions depends on values computed in the previous iteration of the loop. The version with multiplication is faster because there is no dependencies with the previous iteration so the CPU has more freedom scheduling the operations.

The power consumption is a good question.

1 comments

Scheduling plays a part, but it is definitely more about vectorization.
It's almost certainly more about scheduling than vectorization. The data dependencies is going to constantly stall the CPU pipeline, so it's just not able to retire instructions very quickly. The SIMD part is almost certainly a red herring. It's helping, but it's far from why it's so much faster. Tiger Lake can retire 4 plain ol' ADD operations per clock[1] - you don't need SIMD / vectorization to get instruction level parallelism. But you do need to ensure there's no data dependencies. The data dependency here is the 90% cost. The SIMD is just the cherry on top.

1: https://www.agner.org/optimize/instruction_tables.pdf