| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by nemequ1729 2145 days ago

I'm the lead developer of SIMDe.

I wouldn't say that portability is the main focus. The first step is to get portable implementations up and running, but a huge number of functions have optimized implementations for NEON, AltiVec/VSX, and WASM SIMD 128, and we're working on adding more. We go to a lot of trouble to get good performance on multiple architectures, basically writing each implementation several times and using ifdefs to switch depending on what the fastest version available to a given architecture will be.

Even just for the portable implementations, we use a lot of hints to help the compiler auto-vectorize. Almost every portable implementation has a loop which uses a pragma to try to get the compiler do the right thing (OpenMP SIMD, clang loop-specific pragmas, GCC ivdep, etc.). On top of that we take advantage of lots of compiler-sepecific features to speed things up where possible, including GCC-style vector extensions, __builtin_shuffle/__builtin_shufflevector, and __builtin_convertvector.

SIMDe never going to be as fast as someone who knows what they're doing writing an optimized implementation for a given target. However, it should be as fast (or faster) than someone who is just trying to do a direct port where they just try to match the existing code as closely as possible.

1 comments

innocenat 2145 days ago

I don't have experience with SIMD on platform other than x86/amd64, but a few of data-shuffling type functions [0] have SIMD version that is not that faster than scalar C implementation, and the overhead of translation might make then slower.

[0]: http://web.archive.org/web/20140807014206/http://x264dev.mul...

link

nemequ1729 2145 days ago

I hadn't seen that post before, thanks.

This doesn't quite apply to SIMDe; the problem that post is talking about is really at a higher level… whether it is faster to do a bunch of shuffles or use some scalar code. Once SIMDe is called you've already made your decision, and at that level the hardware-based shuffles are much faster than scalar code. For example, see the decompression speed benchmarks for LZSSE-SIMDe (<https://github.com/nemequ/LZSSE-SIMDe>) (they're in the README).

It sounds like what that post really needs is a fast 16-bit gather operation. AVX2 has some 32-bit gather functions which you may be usable (2 gathers + a blend could emulate 16-bit gathers). For NEON, you could probably use one of the `vtbl` functions; they're all 8-bit, but that just means you have separate index entries for high and low bytes… it's a bit more code, but there shouldn't be any runtime overhead.

link

souprock 2144 days ago

That is a common problem. SIMD can be slower than non-SIMD. Consider this problem:

The goal was to emulate a 4-way PowerPC TLB on x86-64. Four uint32_t values had to be compared to find a match. The data structure was roughly "uint32_t array[512][4][4]", laid out so that the 4 uint32_t values would be adjacent.

It simply didn't perform OK. Getting the equality test results out of SIMD was lengthy, awkward, and slow.

That task was so perfect for SIMD, and yet SIMD failed at it. The data was the exactly correct size of an SSE XMM register. It was aligned. The task was a simple parallel operation.

link

Const-me 2144 days ago

Based on your description, here’s what you should do to vectorize that code.

1. If you don’t have AVX, a good way to broadcast integer from scalar register to vector is _mm_cvtsi32_si128 followed by _mm_shuffle_epi32( v, 0 )

2. To compare them for equality, _mm_cmpeq_epi32

3. Getting index of the first match is 2 instructions, MOVMSKPS and BSF.

Getting compiler to emit them is a bit awkward, though. You first need _mm_castsi128_ps to be able to call _mm_movemask_ps. Test the integer for 0 afterwards, if zero, none of the 4 lanes were equal.

The portable way to emit BSF is only introduced in C++/20. In the current version of the language you have to use preprocessor to detect compiler, use _BitScanForward for msvc, __builtin_ctz for gcc/clang.

If you want count of matches, replace BSF with POPCNT. Again, in current version of the language it’s compiler specific, __popcnt for msvc, __builtin_popcount for gcc/clang.

P.S. If you only need a single boolean saying if none of the 4 lanes matched / any of the lanes matched, use _mm_test_all_zeros / _mm_test_mix_ones_zeros instead of _mm_movemask_ps. Or if you want to test more than 1 cache entry, leave the comparison result in a vector register, compare more entries, combine results with bitwise instructions.

Update: If you don’t need index or count of matches but want to individually test all 4 matches with scalar code, on old CPUs _mm_movemask_epi8 is slightly faster because cross-domain latency, test the result for bits 1, 0x10, 0x100, 0x1000.

link

BeeOnRope 2144 days ago

I wouldn't characterize this as "perfect" for SIMD!

Perfect for SIMD usually means a significant amount of calculation that can be done vector-wise (you could include contiguous data movement in that definition).

Here, you are doing exactly one (cheap) calculation: the compare, and one vectorized load, and you want to feed the results to a branch, presumably.

You are only saving a few instructions versus scalar and pay a vector to GP penalty.

link

Const-me 2144 days ago

The penalty is quite small, 1-3 cycles each direction. RAM latency is 1-2 orders of magnitude more than that, even L1D level of cache is many cycles away. Replacing multiple scalar RAM loads with 1 vector load is usually a good idea performance wise. This is true even if you’ll then use extract instructions to access the lanes. Extract latency is 2-3 cycles, much faster than RAM.

I think what might have happened, GP tried to use SSE for dealing with individual lanes. Better approach for that use case is moving the comparison results to scalar register with a single movmskps, pmovmskb, or ptest instruction, just once for the complete vector.

link

BeeOnRope 2144 days ago

Yes, the penalty is small, but the total amount of vectorized work is also very small!

L1D is not many cycles away: it is 4 or 5 for scalar loads, 6 or 7 for xmm or ymm loads. If the load misses, it doesn't much matter if it's a scalar or vector load: the time to fetch the cache line is the same.

So a scalar load of 5 cycles looks much better, latency-wise, than a vector load of 6 cycles, plus an extract of 1-3 cycles.

Of course, you need only 1 vector load vs 4 GP loads, but the latencies are overlapped.

Furthermore, the extracts can happen on a single port: so even though you have 512 bits/cycle of "contiguous" vector load bandwidth, you then suck those loads though a 32 bits/cycle extract straw [1]? 32-bit GP loads have 64 bits/cycle bandwidth and the value goes directly to the GP register, or even micro-fused with the ALU op.

So no, it is not an obvious win to load 4x32-bit values with a vector load and then bring them over to GP registers. Even if it might sometimes be slightly better, this is hardly "perfect" for vectorization, rather I'd say it is "quite poor candidate for vectorization".

Also, if the goal is to set a flag and jump on it, you'll still end up needing a scalar comparison anyway, so actually for the computation part there is no savings.

Don't forget the thing you are comparing to: presumably it starts in a GP register, so you need some kind of GP->SIMD move and then a broadcast to prepare the comparison.

> I think what might have happened, GP tried to use SSE for dealing with individual lanes. Better approach for that use case is moving the comparison results to scalar register with a single movmskps, pmovmskb, or ptest instruction, just once for the complete vector.

Right, well who know what they tried to do or how the surrounding code works. I agree the approach you suggest sounds like it should be a slight win sometimes, but the key word is "slight". If the surrounding code is general purpose code and the inputs and outputs come from and go to GP registers, this is just "too small" to vectorize well. It's a common misconception that say comparing values is the bulk of the work, so of course vectorization will be a 4x win, but actually all the surrounding stuff takes most of the work, much more than a comparison which can execute 4 per cycle on the scalar side.

---

[1] You can try other tricks like extracting 64-bits and then messing around in the GP reg to split the halves, but it's basically a wash.

link