|
|
|
|
|
by anonymoushn
1106 days ago
|
|
This is the simdjson bit_indexer[0] which takes in a mask representing the positions of structural characters ([]{}:,) and outputs a list of the positions of those characters. Basically it uses popcnt to know how many there are and uses tzcnt repeatedly to find the position of a single one. It's a bit questionable whether building this list is necessary, but that concern is just about whether the tzcnt and popcnt belong here or in some other code that processes the structural characters (e.g. to iterate over structural characters you could iterate over the raw masks until you find a nonzero one and then start tzcnt'ing there). If we want to keep the list of positions, there are other approaches for converting the vector mask to the list of positions. We could and the vector mask with {0,1,2,3,4...}, then widen by 4x to get int32 positions and add our current int32 position, then use or emulate vcompress and unconditionally write everything to the list, counting on updating the position by popcnt(mask) to keep the array dense. So we'd still need popcnt, but some implementations of this sort of thing end up computing this some other way[1] and don't literally use the popcnt instruction. This approach might be more reasonable if we only widen by 2x and produce a list of int16 positions per 64k input chunk then go back and widen that whole list to int32 later. [0]: https://github.com/simdjson/simdjson/blob/master/src/generic... [1]: https://github.com/lemire/despacer/blob/master/src/despacer.... |
|
BLSI and BLSR are better instructions for iterating over a bitset and finding the 1s (BLSI) and then removing the 1s (BLSR).
Furthermore, it only took 3 assembly instructions to implement before the BMI instruction set (see: https://www.chessprogramming.org/General_Setwise_Operations#... )
You're correct on popcnt, but this kind of iteration I normally see with BLSI / BLSR, not with TZCNT. I feel like TZCNT is slower and less elegant for this purpose.
---------
To go from a single binary bit selected (ex: 0b00001000) into its index (ie: 3 in this case) I guess TZCNT could be used though. But I would have done popcnt(x-1) instinctively.
TZCNT would be faster though, so there's that.
------
But all of these operations are only ~2 to 3 assembly instructions. Roughly on the same scale as absolute value. With exception of popcnt of course.
----
EDIT: Okay, I think I see how TZCNT comes into play now. Thanks! Just had to think it through.