|
|
|
|
|
by gpderetta
1485 days ago
|
|
If you look carefully at the generated assembly shown in the article, the vectorized loop actually executes less instructions in total as while an iteration is longer it more than make it up by iterating less times. Edit: btw, while the fast version has bo loop carried dependencies and it uses 4x SIMD, it us only twice as fast I wonder if a manually unrolled (with 4 accumulators) and vectorized version of the strength reduced one could be faster still. The compiler won't do it without fast math. |
|
And yes unrolling the loop carried dependency on the strength reduced version will certainly make it faster as it's the only reason it's slower to begin with.
I encourage you to read my other comment here https://news.ycombinator.com/item?id=31551375