Ideally, you'd want your compiler to use the hardware population count instructions though. So this might be one of the places where a simpler algorithm might win out because the compiler can recognize it.
In fact that is one of the follow-ups I want to have out of this. Have architecture-specific optimisations (popcnt on x86) either manually through something like the `asm` crate, or by having a simpler algorithm that the compiler can optimise away.