| Yeah pretty much this. I would separate the knowledge into maybe 3 distinct buckets. The baseline: device/host boundary, SIMT programming etc. The intermediate: kernel architecture, CUDA graph vs persistent kernels, warp specialisation/divergence avoidance techniques etc. The advanced: architecture specifics so tcgen05, TMA, SMEM/HBM, memory throughput vs compute biases in various arch impls., GEMM, FHMA, all the tricks that make modern fused kernels very fast. Also would bucket most GPU Direct RDMA/GPU NetIO/friends here too. The baseline hasn't changed much and probably won't, the intermediate knowledge has also remained pretty reliably stable for ~10 years with only things like graphs changing stuff. Tile might become more relevant than it is today but for now CUDA, cuBLAS, friends are where it's worth investing knowledge. |
This would require very different (re-written?) kernels than a few years back, wouldn't it?
Would you have any good resources on the topic?