Hacker News new | ask | show | jobs
by btdmaster 1479 days ago
On my machine, your code is faster for smaller LEN values. I'm not sure why this is though.
1 comments

8x 64-bit is 512-bit, which is designed for AVX512. You'll probably need AVX512 to fully benefit from unrolling x8.

4x 64-bit is 256-bit, which requires special compiler flags for 256-bit AVX2, but most x86 CPUs should support them these days.

2x64-bit is 128-bit, which fits in default SSE 128-bit SIMD with default GCC / Visual Studio compiler flags.