would totally love to read a modern `Hacker's Delight`. My mind was so blown away the first time I learned about low-level optimizations. I wish I did more of that on a day to day
The VSHRN trick is nice (I used it only two hours ago!), but it really does feel like a crutch; I don't understand why they couldn't simply implement a PMOVMSKB-like instruction to begin with (it cannot possibly be very expensive in silicon, at least not if it moved into a vector register). One-bit-per-byte is really the sweet spot for almost any kind of text manipulation, and often requires less setup/post-fixup on either side of the POVMSKB/VSHRN.
> However, developers often encounter problems with Arm NEON instructions being expensive to move to scalar code and back.
I remember talking to an ARM engineer easily 10 years ago and he told us in that nice british accent: "You know, NEON is like 'back in the yard'" :-D. This has changed a lot, but not enough from what you wrote... Bit sad that these SIMD optimizations are still hand written...