|
|
|
|
|
by Const-me
267 days ago
|
|
> would you know if there is something similar that works on u128 instead of just u32/u64? Not as far as I’m aware, but I think your use case is handled by the u64 version rather well. Instead of u128, use array of two uint64 integers, pack the length into unused high bits of one of them. Here’s example C++ https://godbolt.org/z/Mrfv3hrzr The packing function in that source file requires AVX2, unpack is scalar code based on that BMI1 instruction. Another version with even fewer instructions to unpack, but one extra memory load: https://godbolt.org/z/hnaMY48zh Might be faster if you have a lot of these packed vectors, extracting numbers in a tight loop, and s_extractElements lookup table remains in L1D cache. P.S. I’ve tested that code just a couple of times, might be bugs |
|