|
|
|
|
|
by dzaima
462 days ago
|
|
It's not that trivial: The wrapping version uses vpandn + vpaddb (i.e. `acc += 1 &~ elt`). On Intel since Haswell (2013) on ymm inputs that can manage 1.5 iterations per cycle, if unroll 2x to reduce the dependency chain. Whereas vpsadbw would limit it to 1 iteration per cycle on Intel. On AMD Zen≤2, vpsadbw is still worse, but Zen≥3 manages to have the two approaches be equal. On AVX-512 the two approaches are equivalent everywhere as far as uops.info data goes. |
|