Hacker News new | ask | show | jobs
by gpuhacker 1262 days ago
This is a great post for people who are new to optimizing GPU code.

It is interesting to see that the author got this far without interchanging the innermost loop over k to the outermost loop, as is done in CUTLASS (https://github.com/NVIDIA/cutlass).

As you can see in this blog post the code ends up with a lot of compile-time constants (e.g. BLOCKSIZE, BM, BN, BK, TM, TN) one way to optimize this code further is to use an auto-tuner to find the optimal value for all of these parameters for your GPU and problem size, for example Kernel Tuner (https://github.com/KernelTuner/kernel_tuner)

1 comments

Kernel Tuner is great! Remember going to a tutorial at SC21. Would highly recommend the tutorials they used to get familiar as well (https://github.com/KernelTuner/kernel_tuner_tutorial)