Hacker News new | ask | show | jobs
by AstralStorm 3308 days ago
Which is slower on non-Intel and older CPUs. Add -march=native next time. Nets you a vector version. If you disable sse it generates essentially what icc does with less reliance on ucode.

See, rep is a workaround for older AMD CPU branch predictor. In addition gcc does not use the highly opcoded variant of lea with offsets because it is slow on older Intel.

GCC version makes for much better speculative execution too due to the use of the adder.

1 comments

> Which is slower on non-Intel and older CPUs. Add -march=native next time. Nets you a vector version.

Accepted (though the central advantage of SHLX (with-march=native) and SHL (no -march=native) for this example lies in the greater flexibility of register parameters). To my defense I only have a computer whose processor supports BMI2 for a few months now - so I could not play around with BMI2 before. Otherwise I am sure I would have known such tricks.

> See, rep is a workaround for older AMD CPU branch predictor. In addition gcc does not use the highly opcoded variant of lea with offsets because it is slow on older Intel.

I know that. My personal code philosophy is to avoid such hacks for circumventing performance bugs in outdated processors.

By outdated you mean 5 year old? This is not for ancient Athlons. Those were 64 bit chips already.

Likewise preferring microcoded lea shafts everything older than Haswell on Intel side. Not to mention modern Atom.