|
|
|
|
|
by ascar
1483 days ago
|
|
If you look carefully you will find that the slower algorithm uses less instructions and the faster algorithm uses 2xSIMD and more than twice as many instructions. And yes unrolling the loop carried dependency on the strength reduced version will certainly make it faster as it's the only reason it's slower to begin with. I encourage you to read my other comment here https://news.ycombinator.com/item?id=31551375 |
|
edit: to be clear, I'm only arguing about two things that the original parent quitestioned: whether vectorization is not free (it is because wider ALUs require less instructions) and whether the second loop used more instructions (it does not as it is unrolled by 4).