| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by jart 810 days ago
	The inequality between matrix multiplication implementations is enormous. It gets even more extreme on GPU where I've seen the difference between naïve and cuBLAS going as high as 1000x. Possibly 10000x. I have a lot of faith in myself as an optimization person to be able to beat compilers. I can even beat MKL and hipBLAS if I focus on specific shapes in sizes. But trying to beat cuBLAS at anything makes me feel like Saddam Hussein when they pulled him out of that bunker.

1 comments

marshallward 810 days ago

I'm sure there's more to it, but just comparing the profile output shows aggressive use of prefetch and broadcast instructions.

link

jart 810 days ago

BLIS does that in their kernels. I've tried doing that but was never able to get something better than half as good as MKL. The BLIS technique of tiling across k also requires atomics or an array of locks to write output.

link