| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by mkeeter 590 days ago

For a very deep dive into the subject, this is a great writeup:

How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance (https://siboehm.com/articles/22/CUDA-MMM)

(It's CUDA-specific, so there may be aspects that can't yet be ported to WGPU)

2 comments

zanussbaum 590 days ago

this was a huge inspiration for the post! i tried to highlight it in the blog but it might have gotten buried

there are a few things that i wasn't able to figure out how to get access to/i wasn't sure if they were possible. for example, a lot of Simon's article takes advantage of the warp scheduler and warp tiling.

i had a hard time finding information on if that's even possible with my M2/metal and the general memory access patterns. it seems like CUDA does have better documentation in this regard

link

almostgotcaught 590 days ago

That's a nice tutorial but just to be clear: that is not a deep dive in any sense. It's just the bog standard tricks. It doesn't cover MMA and WMMA, which today is table stakes for fast matmul. Also doesn't cover software pipelining. It's basically a good summary of the basics.

link

saagarjha 590 days ago

It’s a deep dive as of like 2015 probably. I don’t know if anyone has done something similar for modern GEMMs. Maybe the CUTLASS or Colfax people?

link