Hacker News new | ask | show | jobs
by marshallward 815 days ago
I just did a test of OpenBLAS with Intel-compiled BLAS, and it was about 6 GFLOP/s vs 150 GFLOP/s, so I must admit that I was wrong here. Maybe in some sense 4% is not bad, but it's certainly not good. My faith in current compilers has certainly been shattered quite a bit today.

Anyway, I have come to eat crow. Thank you for your insight and helping me to get a much better perspective on this problem. I mostly work with scalar and vector updates, and do not work with matrices very often.

1 comments

The inequality between matrix multiplication implementations is enormous. It gets even more extreme on GPU where I've seen the difference between naïve and cuBLAS going as high as 1000x. Possibly 10000x. I have a lot of faith in myself as an optimization person to be able to beat compilers. I can even beat MKL and hipBLAS if I focus on specific shapes in sizes. But trying to beat cuBLAS at anything makes me feel like Saddam Hussein when they pulled him out of that bunker.
I'm sure there's more to it, but just comparing the profile output shows aggressive use of prefetch and broadcast instructions.
BLIS does that in their kernels. I've tried doing that but was never able to get something better than half as good as MKL. The BLIS technique of tiling across k also requires atomics or an array of locks to write output.