|
|
|
|
|
by rasz
1486 days ago
|
|
>The slower algo will use more instructions The whole point of this thread is realization of opposite. Slower algo executes only 2 instructions in a loop, but second one directly depends on the result of the first (induces pipeline stall) while the fast version brute forces a ton of instructions at full CPU IPC. |
|
Edit: btw, while the fast version has bo loop carried dependencies and it uses 4x SIMD, it us only twice as fast
I wonder if a manually unrolled (with 4 accumulators) and vectorized version of the strength reduced one could be faster still.
The compiler won't do it without fast math.