Hacker News new | ask | show | jobs
by RaisingSpear 849 days ago
> There are some times when you're doing these operations in large vector types (like AVX), in parallel, when falling back to the SWAR techniques can actually be faster because of limited bandwidth on the port for micro ops in popcnt or clz.

I'm doubtful of this in general. Some older processors did have a very slow BSF/BSR implementation, so it may be the case there, but I wouldn't generally stick with that assumption these days. (and it's most certainly not the case on AMD Zen, given all its ALUs can do these bit ops)

AVX doesn't have lzcnt prior to AVX-512 CD, so it could make more sense there, though I'd imagine that abusing float conversion (convert to float, extract exponent) would still be faster.

1 comments

Oh, I don't think it's a general principal, it's very processor architecture specific. But here's one example from earlier avx-supporting CPUs: http://0x80.pl/articles/sse-popcount.html
POPCNT doesn't exist in AVX prior to Icelake, so using other techniques there is sensible. Also, that's POPCNT, not LZCNT.