Hacker News new | ask | show | jobs
by aqrit 462 days ago
A optimized version would use 64-bit accumulators (`psadbw` on SSE2, or some sort of horizontal adds on NEON). The `255` max constraint is pointless.

Many programming languages/frameworks expose this operation as `reduce()`.

2 comments

It's not that trivial:

The wrapping version uses vpandn + vpaddb (i.e. `acc += 1 &~ elt`). On Intel since Haswell (2013) on ymm inputs that can manage 1.5 iterations per cycle, if unroll 2x to reduce the dependency chain.

Whereas vpsadbw would limit it to 1 iteration per cycle on Intel.

On AMD Zen≤2, vpsadbw is still worse, but Zen≥3 manages to have the two approaches be equal.

On AVX-512 the two approaches are equivalent everywhere as far as uops.info data goes.

Reduce does not accept a predicate.
It has no need for that. count_if is a fold/reduce operation where the accumulator is simply incremented by `(int)some_condition(x)` for all x. In Rust:

  let arr = [ 1, 3, 4, 6,7, 0, 9, -4];
  let n_evens = arr.iter().fold(0, |acc, i| acc + (i & 1 == 0) as usize);
  assert_eq!(n_evens, 4);
Or more generally,

  fn count_if<T>(it: impl Iterator<Item=T>, pred: impl Fn(&T) -> bool) -> usize {
      it.fold(0, |acc, t| acc + pred(&t) as usize)
  }
I know that. But that’s still a different interface. If you have a predicate you now have to wrap that in a different closure that conforms it to a new pattern.

This is the same argument as why have count_if if I can write a for loop.

Sure. But at least I interpreted the GP as just saying that the "count-if" operation can be implemented in terms of `reduce` if the latter is available.