Hacker News new | ask | show | jobs
by richardwhiuk 4104 days ago
Of course, the next step is obvious - work out why the compiler didn't do a four way avx unroll, and then submit a bug fix to clang to make it do that. That way all of your future code benefits from your single micro-optimization.

It's also possible that you find out that if you enable --generate-for-haswell or some other arcane compiler flag, it'll do it for you.

1 comments

All the author had to to was to add '-march=native' or '-march=core-avx2' to the compiler command line: http://goo.gl/H4f62I
Clang 3.7.0 (experimental) + -march=skylake gives you AVX512. zmm all the way, baby! 256 bytes processed in the inner loop!
But what gives me a Skylake CPU?
A time machine, or a job working at Intel? :)