|
|
|
|
|
by BeeOnRope
2146 days ago
|
|
I wouldn't characterize this as "perfect" for SIMD! Perfect for SIMD usually means a significant amount of calculation that can be done vector-wise (you could include contiguous data movement in that definition). Here, you are doing exactly one (cheap) calculation: the compare, and one vectorized load, and you want to feed the results to a branch, presumably. You are only saving a few instructions versus scalar and pay a vector to GP penalty. |
|
I think what might have happened, GP tried to use SSE for dealing with individual lanes. Better approach for that use case is moving the comparison results to scalar register with a single movmskps, pmovmskb, or ptest instruction, just once for the complete vector.