Hacker News new | ask | show | jobs
by BeeOnRope 2146 days ago
I wouldn't characterize this as "perfect" for SIMD!

Perfect for SIMD usually means a significant amount of calculation that can be done vector-wise (you could include contiguous data movement in that definition).

Here, you are doing exactly one (cheap) calculation: the compare, and one vectorized load, and you want to feed the results to a branch, presumably.

You are only saving a few instructions versus scalar and pay a vector to GP penalty.

1 comments

The penalty is quite small, 1-3 cycles each direction. RAM latency is 1-2 orders of magnitude more than that, even L1D level of cache is many cycles away. Replacing multiple scalar RAM loads with 1 vector load is usually a good idea performance wise. This is true even if you’ll then use extract instructions to access the lanes. Extract latency is 2-3 cycles, much faster than RAM.

I think what might have happened, GP tried to use SSE for dealing with individual lanes. Better approach for that use case is moving the comparison results to scalar register with a single movmskps, pmovmskb, or ptest instruction, just once for the complete vector.

Yes, the penalty is small, but the total amount of vectorized work is also very small!

L1D is not many cycles away: it is 4 or 5 for scalar loads, 6 or 7 for xmm or ymm loads. If the load misses, it doesn't much matter if it's a scalar or vector load: the time to fetch the cache line is the same.

So a scalar load of 5 cycles looks much better, latency-wise, than a vector load of 6 cycles, plus an extract of 1-3 cycles.

Of course, you need only 1 vector load vs 4 GP loads, but the latencies are overlapped.

Furthermore, the extracts can happen on a single port: so even though you have 512 bits/cycle of "contiguous" vector load bandwidth, you then suck those loads though a 32 bits/cycle extract straw [1]? 32-bit GP loads have 64 bits/cycle bandwidth and the value goes directly to the GP register, or even micro-fused with the ALU op.

So no, it is not an obvious win to load 4x32-bit values with a vector load and then bring them over to GP registers. Even if it might sometimes be slightly better, this is hardly "perfect" for vectorization, rather I'd say it is "quite poor candidate for vectorization".

Also, if the goal is to set a flag and jump on it, you'll still end up needing a scalar comparison anyway, so actually for the computation part there is no savings.

Don't forget the thing you are comparing to: presumably it starts in a GP register, so you need some kind of GP->SIMD move and then a broadcast to prepare the comparison.

> I think what might have happened, GP tried to use SSE for dealing with individual lanes. Better approach for that use case is moving the comparison results to scalar register with a single movmskps, pmovmskb, or ptest instruction, just once for the complete vector.

Right, well who know what they tried to do or how the surrounding code works. I agree the approach you suggest sounds like it should be a slight win sometimes, but the key word is "slight". If the surrounding code is general purpose code and the inputs and outputs come from and go to GP registers, this is just "too small" to vectorize well. It's a common misconception that say comparing values is the bulk of the work, so of course vectorization will be a 4x win, but actually all the surrounding stuff takes most of the work, much more than a comparison which can execute 4 per cycle on the scalar side.

---

[1] You can try other tricks like extracting 64-bits and then messing around in the GP reg to split the halves, but it's basically a wash.