Hacker News new | ask | show | jobs
by somethingsome 36 days ago
Having read or at least skimmed most of those books, I think the best intro is 'CUDA Programming: A Developer's Guide to Parallel Computing with GPUs'

Massively Parallel Processors: A Hands-on Approach is not really good in my opinion, many small mistakes and confusing sentences (even when you know cuda).

CUDA by Example: An Introduction to General-Purpose GPU Programming is too simple and abstract too much the architecture.

Next year I'm planning to start writing a cuda book that starts by engineering the hardware, and goes up to the optimization part on that harware (which is basically a nvidia card) including all the main algorithms (except for graphs).

I'm already teaching the course in this way at uni, and it is quite successful among students.

6 comments

Interesting, thanks for sharing.

What makes CUDA Programming: A Developer's Guide to Parallel Computing with GPUs better among its peers?

That's hardly a guide. It's the defacto documentation, you have to read this either way.
Very valuable comment. Thank you.

I always appreciate book lists like this one, but having a small targeted list is more practical for those of us with limited reading time.

I really wish there were better options to PMPP... It's by far the most up-to-date book, but I totally agree the writing is sort of bad and some of the code examples are straight up incorrect.

So tl;dr, you have at least one person who would pay for a better book :-)

Thank you, that is very useful advice !
the first book was published in 2012,is it too outdated?
Not really, Hardware didn't really change that much, of course you'll not find Tensor or raytracing cores, but you will have a very solid grasp of gpu programming and the cuda language (that didn't change that much either), and then you can easily learn those more modern things with blog posts or even, at worst, chatgpt.
Yeah pretty much this.

I would separate the knowledge into maybe 3 distinct buckets.

The baseline: device/host boundary, SIMT programming etc.

The intermediate: kernel architecture, CUDA graph vs persistent kernels, warp specialisation/divergence avoidance techniques etc.

The advanced: architecture specifics so tcgen05, TMA, SMEM/HBM, memory throughput vs compute biases in various arch impls., GEMM, FHMA, all the tricks that make modern fused kernels very fast. Also would bucket most GPU Direct RDMA/GPU NetIO/friends here too.

The baseline hasn't changed much and probably won't, the intermediate knowledge has also remained pretty reliably stable for ~10 years with only things like graphs changing stuff. Tile might become more relevant than it is today but for now CUDA, cuBLAS, friends are where it's worth investing knowledge.

> [..] all the tricks that make modern fused kernels very fast

This would require very different (re-written?) kernels than a few years back, wouldn't it?

Would you have any good resources on the topic?

There's actually little that changed in a way too fundamentally to matter other than _perhaps_ getting the async load-from-global-to-shared-memory DMA memcpy that avoided blocking register file space as target buffers for in-flight read-from-global operations. Shared after all is just a partition of L1d$ since iirc Volta (since they offered non-fixed/at-launch-requested expanded shared capacity support), so it made sense to provide this not-just-a-hint "prefetch into this user-managed slice of what is otherwise L1d$": it's AFAIK basically just some special load-like units that ask special L1d$-miss-fill units to deliver to a now-explicitly-specified target location in the non-automatic-cache partition of the local SRAM and signal completion in otherwise fairly normal local semaphore/barrier fashion.

The major difference is that this doesn't have a natural moment to transform/touch the values after read from global and before storage to shared.

Otherwise, tiled MMA (gemm) kernels where normal even in Maxwell days (after the classic K80, before the P100; Maxwell is when H.265 support landed).

I wish there were any good literature on GPU Direct RDMA and GPU NetIO. Got any tips?
So I would say the most important thing is that the APIs these are using as in mlx5 DevX (essentially direct fw access) or ibverbs are exactly the same regardless if it's CPU or GPU talking to it. So with that in mind the source of rdma-core, DPDK, ucx etc may be the most elucidating when it comes to low level details.

For higher level patterns again the APIs are the same so anything building on libibverbs or aforementioned ucx etc are pretty compatible from a high level ideas perspective. If you are new to RDMA in general definitely start with raw verbs instead of using abstractions like MPI if you really want to build a good intuition and then move to MPI once you understand what it is doing for you.

Thanks I appreciate the answer!