Hacker News new | ask | show | jobs
by wolfgke 3309 days ago
It does not work very well:

> https://godbolt.org/g/qmQhLK

Exercise: Write an assembly function that is faster than "-O3" and loop-free. Should be easy.

2 comments

UPDATE: For those who don't want to do the exercise: The Intel compiler (icc) optimizes it in about the way that I would if I were to write it in assembly:

> https://godbolt.org/g/LntTnY

Which is slower on non-Intel and older CPUs. Add -march=native next time. Nets you a vector version. If you disable sse it generates essentially what icc does with less reliance on ucode.

See, rep is a workaround for older AMD CPU branch predictor. In addition gcc does not use the highly opcoded variant of lea with offsets because it is slow on older Intel.

GCC version makes for much better speculative execution too due to the use of the adder.

> Which is slower on non-Intel and older CPUs. Add -march=native next time. Nets you a vector version.

Accepted (though the central advantage of SHLX (with-march=native) and SHL (no -march=native) for this example lies in the greater flexibility of register parameters). To my defense I only have a computer whose processor supports BMI2 for a few months now - so I could not play around with BMI2 before. Otherwise I am sure I would have known such tricks.

> See, rep is a workaround for older AMD CPU branch predictor. In addition gcc does not use the highly opcoded variant of lea with offsets because it is slow on older Intel.

I know that. My personal code philosophy is to avoid such hacks for circumventing performance bugs in outdated processors.

By outdated you mean 5 year old? This is not for ancient Athlons. Those were 64 bit chips already.

Likewise preferring microcoded lea shafts everything older than Haswell on Intel side. Not to mention modern Atom.

Yes but the point was that compilers do changes algorithms.