Hacker News new | ask | show | jobs
by bobbyi 1106 days ago
> There is no scalar instruction for calculating the absolute value, so it uses a negate and conditional move instead.

Wait, what? x86 has specialized instructions for all kinds of things but doesn't have absolute value?

1 comments

IIRC you can compose an xor and subtract to do abs, which might eliminate the need for a dedicated opcode. Although negate/cmov might work better. Either way you'd only need an abs if it showed up often enough that you wanted to save code size or could schedule the pipeline better, which seems unlikely.
I wonder how many people need anything in the BMI suite of assembly language, or BMI2.

https://en.m.wikipedia.org/wiki/X86_Bit_manipulation_instruc...

Popcnt, LZCNT and TZCNT are pretty nifty and all, but probably more obscure than absolute value.

I do think pdep and pext are brilliant and that more people should play with those two BMI2 instructions.

A dedicated absolute value instruction would save exactly one instruction every time you do absolute value, and both of the instructions for naive absolute value are already pretty fast. Dedicated L/TZCNT instructions save a dozen or so instructions every time you do that operation, and some 'clever' ways to do it might use fewer instructions but some are fairly slow instructions.

Amdahl's law. You don't gain a lot of benefit by improving something which is already fast.

Popcount is extremely common in algorithms. XOR then popcount gives hamming distance, for example. I’ve used lzcnt many times as well, in parsing binary formats and flag fields.
tzcnt and popcnt are for parsing strings! If you'd like to e.g. lex a json document then you'll need these.
Popcnt is probably the most broadly applicable BMI instruction with a myriad of applications in parallel programming. I'm not surprised to hear that it has applications for string parsing, but I admit that I can't see how right now. But I can believe it even without seeing an example.

TZCNT being used for string parsing... Sorry, can't see it at all or how it could be possible lol. Could you give an example?

(31 - LZCNT(x)) is binary 32-bit logarithm and likely has a number of mathematical applications.

This is the simdjson bit_indexer[0] which takes in a mask representing the positions of structural characters ([]{}:,) and outputs a list of the positions of those characters. Basically it uses popcnt to know how many there are and uses tzcnt repeatedly to find the position of a single one. It's a bit questionable whether building this list is necessary, but that concern is just about whether the tzcnt and popcnt belong here or in some other code that processes the structural characters (e.g. to iterate over structural characters you could iterate over the raw masks until you find a nonzero one and then start tzcnt'ing there).

If we want to keep the list of positions, there are other approaches for converting the vector mask to the list of positions. We could and the vector mask with {0,1,2,3,4...}, then widen by 4x to get int32 positions and add our current int32 position, then use or emulate vcompress and unconditionally write everything to the list, counting on updating the position by popcnt(mask) to keep the array dense. So we'd still need popcnt, but some implementations of this sort of thing end up computing this some other way[1] and don't literally use the popcnt instruction. This approach might be more reasonable if we only widen by 2x and produce a list of int16 positions per 64k input chunk then go back and widen that whole list to int32 later.

[0]: https://github.com/simdjson/simdjson/blob/master/src/generic...

[1]: https://github.com/lemire/despacer/blob/master/src/despacer....

> Basically it uses popcnt to know how many there are and uses tzcnt repeatedly to find the position of a single one.

BLSI and BLSR are better instructions for iterating over a bitset and finding the 1s (BLSI) and then removing the 1s (BLSR).

Furthermore, it only took 3 assembly instructions to implement before the BMI instruction set (see: https://www.chessprogramming.org/General_Setwise_Operations#... )

You're correct on popcnt, but this kind of iteration I normally see with BLSI / BLSR, not with TZCNT. I feel like TZCNT is slower and less elegant for this purpose.

---------

To go from a single binary bit selected (ex: 0b00001000) into its index (ie: 3 in this case) I guess TZCNT could be used though. But I would have done popcnt(x-1) instinctively.

TZCNT would be faster though, so there's that.

------

But all of these operations are only ~2 to 3 assembly instructions. Roughly on the same scale as absolute value. With exception of popcnt of course.

----

EDIT: Okay, I think I see how TZCNT comes into play now. Thanks! Just had to think it through.