Hacker News new | ask | show | jobs
by lifthrasiir 1708 days ago
Well, it seems that I've missed an obvious optimization with 256-bit store:

        _mm256_store_si256((void*)(ws_out + x * 4),      _mm256_permute2x128_si256(out1, out2, 0x20));
        _mm256_store_si256((void*)(ws_out + x * 4 + 32), _mm256_permute2x128_si256(out3, out4, 0x20));
        _mm256_store_si256((void*)(ws_out + x * 4 + 64), _mm256_permute2x128_si256(out1, out2, 0x31));
        _mm256_store_si256((void*)(ws_out + x * 4 + 96), _mm256_permute2x128_si256(out3, out4, 0x31));
The relative time to encode 1 MB input (timed with 8,192 iterations):

    5.51x   original
    1.00x   bytewise (baseline)
    0.83x   SSSE3 (the original version I've posted)
    0.74x   AVX2 (parent)
    0.60x   AVX2 (updated)
Unfortunately my machine (i7-7700) can't run anything beyond AVX2.
1 comments

> Unfortunately my machine (i7-7700) can't run anything beyond AVX2.

You machine has BMI2. It's not SIMD, but it handles 8 bytes at a time, and very suitable for packing and unpacking these bits in this case.

https://godbolt.org/z/xcT3exenr

Ugh, you were correct. I did copy and paste your code to my testing framework and it instantly crashed at that time, but it seems that I put a wrong offset to the output. The resulting code was slightly faster (by 2--4%) than my AVX2 code.