| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by wolfgke 3309 days ago
	UPDATE: For those who don't want to do the exercise: The Intel compiler (icc) optimizes it in about the way that I would if I were to write it in assembly: > https://godbolt.org/g/LntTnY

1 comments

AstralStorm 3308 days ago

Which is slower on non-Intel and older CPUs. Add -march=native next time. Nets you a vector version. If you disable sse it generates essentially what icc does with less reliance on ucode.

See, rep is a workaround for older AMD CPU branch predictor. In addition gcc does not use the highly opcoded variant of lea with offsets because it is slow on older Intel.

GCC version makes for much better speculative execution too due to the use of the adder.

link

wolfgke 3308 days ago

> Which is slower on non-Intel and older CPUs. Add -march=native next time. Nets you a vector version.

Accepted (though the central advantage of SHLX (with-march=native) and SHL (no -march=native) for this example lies in the greater flexibility of register parameters). To my defense I only have a computer whose processor supports BMI2 for a few months now - so I could not play around with BMI2 before. Otherwise I am sure I would have known such tricks.

> See, rep is a workaround for older AMD CPU branch predictor. In addition gcc does not use the highly opcoded variant of lea with offsets because it is slow on older Intel.

I know that. My personal code philosophy is to avoid such hacks for circumventing performance bugs in outdated processors.

link

AstralStorm 3308 days ago

By outdated you mean 5 year old? This is not for ancient Athlons. Those were 64 bit chips already.

Likewise preferring microcoded lea shafts everything older than Haswell on Intel side. Not to mention modern Atom.

link