Hacker News new | ask | show | jobs
by dzaima 462 days ago
It's not that trivial:

The wrapping version uses vpandn + vpaddb (i.e. `acc += 1 &~ elt`). On Intel since Haswell (2013) on ymm inputs that can manage 1.5 iterations per cycle, if unroll 2x to reduce the dependency chain.

Whereas vpsadbw would limit it to 1 iteration per cycle on Intel.

On AMD Zen≤2, vpsadbw is still worse, but Zen≥3 manages to have the two approaches be equal.

On AVX-512 the two approaches are equivalent everywhere as far as uops.info data goes.