Hacker News new | ask | show | jobs
by crest 250 days ago
ARMv8A has nice scalar bit (un)packing instructions. I wonder if NEON is really an improvement over those given that ARM cores tend to have few SIMD ports and NEON is just 128 wide.
1 comments

I'm assuming you're referring to BFM/EXTR? NEON absolutely improves here.

The core I developed on (Neoverse V2) has 4 SIMD ports and 6 scalar integer ports, however only 2 of those scalar ports support multicycle integer operations like the insert variant of BFM (essential for scalar packing).

More importantly, NEON progresses 16 elements per instruction instead of 1.