| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by richardwhiuk 4150 days ago
	Of course, the next step is obvious - work out why the compiler didn't do a four way avx unroll, and then submit a bug fix to clang to make it do that. That way all of your future code benefits from your single micro-optimization. It's also possible that you find out that if you enable --generate-for-haswell or some other arcane compiler flag, it'll do it for you.

1 comments

All the author had to to was to add '-march=native' or '-march=core-avx2' to the compiler command line: http://goo.gl/H4f62I

Clang 3.7.0 (experimental) + -march=skylake gives you AVX512. zmm all the way, baby! 256 bytes processed in the inner loop!

But what gives me a Skylake CPU?

A time machine, or a job working at Intel? :)