UPDATE: For those who don't want to do the exercise: The Intel compiler (icc) optimizes it in about the way that I would if I were to write it in assembly:
Which is slower on non-Intel and older CPUs. Add -march=native next time. Nets you a vector version.
If you disable sse it generates essentially what icc does with less reliance on ucode.
See, rep is a workaround for older AMD CPU branch predictor. In addition gcc does not use the highly opcoded variant of lea with offsets because it is slow on older Intel.
GCC version makes for much better speculative execution too due to the use of the adder.
> Which is slower on non-Intel and older CPUs. Add -march=native next time. Nets you a vector version.
Accepted (though the central advantage of SHLX (with-march=native) and SHL (no -march=native) for this example lies in the greater flexibility of register parameters). To my defense I only have a computer whose processor supports BMI2 for a few months now - so I could not play around with BMI2 before. Otherwise I am sure I would have known such tricks.
> See, rep is a workaround for older AMD CPU branch predictor. In addition gcc does not use the highly opcoded variant of lea with offsets because it is slow on older Intel.
I know that. My personal code philosophy is to avoid such hacks for circumventing performance bugs in outdated processors.
See, rep is a workaround for older AMD CPU branch predictor. In addition gcc does not use the highly opcoded variant of lea with offsets because it is slow on older Intel.
GCC version makes for much better speculative execution too due to the use of the adder.