|
|
|
|
|
by souprock
2147 days ago
|
|
That is a common problem. SIMD can be slower than non-SIMD. Consider this problem: The goal was to emulate a 4-way PowerPC TLB on x86-64. Four uint32_t values had to be compared to find a match. The data structure was roughly "uint32_t array[512][4][4]", laid out so that the 4 uint32_t values would be adjacent. It simply didn't perform OK. Getting the equality test results out of SIMD was lengthy, awkward, and slow. That task was so perfect for SIMD, and yet SIMD failed at it. The data was the exactly correct size of an SSE XMM register. It was aligned. The task was a simple parallel operation. |
|
1. If you don’t have AVX, a good way to broadcast integer from scalar register to vector is _mm_cvtsi32_si128 followed by _mm_shuffle_epi32( v, 0 )
2. To compare them for equality, _mm_cmpeq_epi32
3. Getting index of the first match is 2 instructions, MOVMSKPS and BSF.
Getting compiler to emit them is a bit awkward, though. You first need _mm_castsi128_ps to be able to call _mm_movemask_ps. Test the integer for 0 afterwards, if zero, none of the 4 lanes were equal.
The portable way to emit BSF is only introduced in C++/20. In the current version of the language you have to use preprocessor to detect compiler, use _BitScanForward for msvc, __builtin_ctz for gcc/clang.
If you want count of matches, replace BSF with POPCNT. Again, in current version of the language it’s compiler specific, __popcnt for msvc, __builtin_popcount for gcc/clang.
P.S. If you only need a single boolean saying if none of the 4 lanes matched / any of the lanes matched, use _mm_test_all_zeros / _mm_test_mix_ones_zeros instead of _mm_movemask_ps. Or if you want to test more than 1 cache entry, leave the comparison result in a vector register, compare more entries, combine results with bitwise instructions.
Update: If you don’t need index or count of matches but want to individually test all 4 matches with scalar code, on old CPUs _mm_movemask_epi8 is slightly faster because cross-domain latency, test the result for bits 1, 0x10, 0x100, 0x1000.