Hacker News new | ask | show | jobs
by saagarjha 427 days ago
This is a really good post, and a nice successor to the earlier one that I've pointed people at before: https://siboehm.com/articles/22/CUDA-MMM (this is linked in the post, of course). That one is a good introduction but I found that it doesn't really explain a lot of the choices it makes beyond the basic ones (coalescing=good, bank conflicts=bad). This is makes it a good introduction but this goes a lot further into the details that are important when you try to attack cuBLAS performance for faster kernels on more recent hardware.

In fact I recently had someone go through almost the same steps here, although they were using Ampere GPUs rather than Turing. As the post mentions, the microarchitecture is almost the same, especially when talking about the tensor cores (Hopper has significant changes). One difference is that it supports asynchronous global to shared memory transfers, which is quite useful. And you can get these for quite cheap: a RTX 3060 is a good card to test on. Vast.ai will rent one to you for a fraction of the price the author paid, and you can run Nsight Compute on them if you pick the VM instances (which costs several times more–the Docker instances are actually like a tenth of the prices the author paid).

I don't really have anything to point out about the content here but I will add that you will want to pipeline shared memory loads (to registers) in addition to those coming global memory. If you do this it's very possible to beat cuBLAS, which doesn't do that (CUTLASS does). I think around that microarchitecture you want at least a two-deep pipeline where global memory loads to shared memory two iterations head, then shared memory populates registers one iteration ahead, then you do the MMA tile. As you get faster and faster tensor cores this pipeline increases, and async copies will come help make your life easier–but the actual benefit here is that you save registers, which are a scarce resource. Async copies don't need to pin registers while waiting for global→shared stores. In fact Blackwell has dedicated "tensor memory" to directly load your tensor cores without needing to use registers at all. It's actually larger than your registers and shared memory put together, IIRC.

Most people have moved on at this point but I did want to say that I think the ldmatrix instruction was a mistake, honestly–it is designed to hide the matrix layout from you, but in doing so it also makes it inevitable that you'll have conflicts. One thing I haven't heard many people talk about is that conflicts actually occur in a transaction, rather than a warp. This also applies to coalescing from global memory. So for both if you use vectorized operations you can fill a transaction with fewer threads, reducing the chance of conflicts (or threads cooperating for coalescing). If all threads perform vectorized loads, you reduce your worst-case conflicts from 32-way to 8-way! It's an easy way to improve your performance if you're having trouble with laying out data across the entire warp.

Also, occupancy of GEMM is basically 2-3 blocks per SM, if that. This is mentioned by this post, and Simon does too. But I think those who are new to writing high-performance compute bound kernels don't really internalize it as they should. The model of "more threads, more occupancy, to hide the memory latency" is truly dead and buried. It is hard to feed tensor cores but you can't solve this by wavefront parallelism anymore. Everything is explicitly pipelined, with more and more of it being done in software in the direction of computation, rather than horizontally across warps or SMs.

1 comments

thanks so much for reading!!

>I think around that microarchitecture you want at least a two-deep pipeline where global memory loads to shared memory two iterations head

I agree, if I were to spend more time on this, I would add another layer of pipelining like you say, and also tweak it so that each threadblock is persistent on a particular SM, and processes multiple tiles. This allows you to hide the write latency of the epilogue, and arrange the SM<->data layout in way that maximizes L2 cache locality. (good explanation of this here https://cudaforfun.substack.com/p/outperforming-cublas-on-h1...)

>I did want to say that I think the ldmatrix instruction was a mistake

I agree. I found this instruction wierd because it hides which thread is reading which data, but it causes shared memory bank conflicts, so you are left to guess which thread is reading what. I find using the TMA on hopper is much nicer.

>The model of "more threads, more occupancy, to hide the memory latency" is truly dead and buried

Well said, seems like all the intro to CUDA textbooks need to be rewritten. It seems like now for all GEMM-like kernels, occupancy matters very little, and its more about using dedicated, asychronous hardware units properly in conjunction one another. I like this because there is a bit less black magic involved when chasing the long tail of performance. This is well put here

https://research.colfax-intl.com/cutlass-tutorial-writing-ge...

"In historical context, these developments continue a trend of replacing general-purpose computational resources by specialized hardware resources, to both remove bottlenecks and free up those general-purpose resources for other operations. Starting with the Volta architecture, the Tensor Cores divorced GEMM arithmetic operations from the general computational pipeline. Ampere’s asynchronous copy instructions allowed for true pipelining of GEMM mainloops. On Hopper GPUs, the asynchronous, single-threaded TMA and the ability to reallocate registers between warpgroups dramatically reduced the register and thread cost of data movement, and the asynchronous WGMMA allowed for pipelining of MMA with other compute operations. Now, Tensor Memory and UMMA do for MMA just what TMA did for copy, making it a single-threaded, asynchronous operation that does not consume registers. As a result, registers can primarily be used for other tasks like scheduling and fused epilogue operations."

Well, CUDA is used for a lot of things. It can make sense to increase occupancy when your code is memory bound, for example.