Hacker News new | ask | show | jobs
by Const-me 2147 days ago
Based on your description, here’s what you should do to vectorize that code.

1. If you don’t have AVX, a good way to broadcast integer from scalar register to vector is _mm_cvtsi32_si128 followed by _mm_shuffle_epi32( v, 0 )

2. To compare them for equality, _mm_cmpeq_epi32

3. Getting index of the first match is 2 instructions, MOVMSKPS and BSF.

Getting compiler to emit them is a bit awkward, though. You first need _mm_castsi128_ps to be able to call _mm_movemask_ps. Test the integer for 0 afterwards, if zero, none of the 4 lanes were equal.

The portable way to emit BSF is only introduced in C++/20. In the current version of the language you have to use preprocessor to detect compiler, use _BitScanForward for msvc, __builtin_ctz for gcc/clang.

If you want count of matches, replace BSF with POPCNT. Again, in current version of the language it’s compiler specific, __popcnt for msvc, __builtin_popcount for gcc/clang.

P.S. If you only need a single boolean saying if none of the 4 lanes matched / any of the lanes matched, use _mm_test_all_zeros / _mm_test_mix_ones_zeros instead of _mm_movemask_ps. Or if you want to test more than 1 cache entry, leave the comparison result in a vector register, compare more entries, combine results with bitwise instructions.

Update: If you don’t need index or count of matches but want to individually test all 4 matches with scalar code, on old CPUs _mm_movemask_epi8 is slightly faster because cross-domain latency, test the result for bits 1, 0x10, 0x100, 0x1000.