Great article! Shame it left me curious to know how a further implementation using more pure Assembly would’ve performed vs using intrinsics. Anyone know or is it easy to just assume “faster”?
I've generally found no/minimal change between assembly and intrinsics. Once I start using them I tend to look at the generated assembly to see what's actually being generated and to make sure the compiler isn't doing something surprising.