Hacker News new | ask | show | jobs
by Const-me 269 days ago
The support for BMI1 instruction set extension is almost universal by now. The extension was introduced in AMD Jaguar and Intel Haswell, both launched in 2013 i.e. 12 years ago.

Instead of doing stuff like (word >> bit_offset) & self.mask, in C or C++ I usually write _bextr_u64, or when using modern C# Bmi1.X64.BitFieldExtract. Note however these intrinsics compile into 2 instructions not one, because the CPU instruction takes start/length arguments from lower/higher bytes of a single 16-bit number.

2 comments

BEXTR basically does the same thing, yes. I'm sticking with the portable shift-and-mask, though. My bet is that LLVM is smart enough to see that pattern and emit a BEXTR on its own when the target supports BMI1.

Using the intrinsic directly would also kill portability. I'd need #[cfg] for ARM and runtime checks for older x86 CPUs, which adds complexity for a tiny, if any, gain. The current code just lets the compiler pick the best instruction on any architecture.

I am handling a lot of vectors of size 6 or less with integers of values lower than 500.000. I am packing individual vectors in a u128, which comes down to 6*19 + a 3 bit length field = 117 bits with 11 bit left to spare.

I chose u128 instead of an array of [u64;2] due to the boundary crossing issue that I saw would happen, which I could avoid using u128.

I am not very familiar with these specific bit manipulation instruction sets, so would you know if there is something similar that works on u128 instead of just u32/u64?

> would you know if there is something similar that works on u128 instead of just u32/u64?

Not as far as I’m aware, but I think your use case is handled by the u64 version rather well. Instead of u128, use array of two uint64 integers, pack the length into unused high bits of one of them.

Here’s example C++ https://godbolt.org/z/Mrfv3hrzr The packing function in that source file requires AVX2, unpack is scalar code based on that BMI1 instruction.

Another version with even fewer instructions to unpack, but one extra memory load: https://godbolt.org/z/hnaMY48zh Might be faster if you have a lot of these packed vectors, extracting numbers in a tight loop, and s_extractElements lookup table remains in L1D cache.

P.S. I’ve tested that code just a couple of times, might be bugs

I don't think there's a single instruction to do this, but you could probably do it with a combination of shld + bzhi + cmov. rustc already seems to do a great job, and whatever I could come up with that assumes [src, src + len] is always in bounds isn't that much better.

Edit: https://godbolt.org/z/rrhW6T7Mc

Hvala puno. Very interesting! I will see how my implementation compares in asm.
Cool :) Feel free to reach out as well. I should note that that link is optimized for variable lengths and offsets -- if your lengths and offsets are constant then it can be much more efficient and I'd expect rustc/LLVM to nail it.