| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by psykotic 1836 days ago

As a counterexample to your claim about "every other ISA", ARM64 doesn't have an x86-style popcount instruction. Its CNT instruction operates on vectors, not words, and gives you the popcount within each byte. You need to perform a bytewise horizontal sum if you want to combine the counts across lanes.

If you start with a word in a GPR and want to compute its popcount into another GPR, you need to move the GPR into a vector register, do the CNT, do the horizontal sum, and move the result back into a GPR.

It's been a while since I did the test, but I think the latency I measured on the Apple M1 perf cores (Firestorm) for that instruction sequence was around 15 cycles. [1] Compare that to POPCNT being 3 cycles on Intel and it really hurts for latency-sensitive uses of word popcount. However, CNT is great for doing large reductions since you can do 32 rounds of CNT + vector add before the bytewise counters can overflow, so you can amortize away the cost of the horizontal reduction. (People end up doing similar things on x86 for large streaming popcount calculations.)

A lot of my micro-optimized code relies on low-latency word popcount, so I selfishly want every processor to have an Intel-style popcount. :)

[1] You can actually beat the latency of that very easily with an L1-resident byte popcount lookup table + sum reduction. This table is only 256B, so it's very practical. You can shave off 1-2 cycles of latency by going to a halfword popcount lookup table, which is 64KB. That kind of cache hogging will perform better in microbenchmarks than in real applications but with the M1's large L1 cache of 128KB per core it might be a net win if you have a tight loop that's heavily affected by popcount latency.