Hacker News new | ask | show | jobs
by imtringued 424 days ago
I'm personally starting to get the impression that the CUDA programming model is actually terrible for things like matrix multiplication.

The author is constantly trying to work around the mismatch between the software model and the hardware model.

When doing matrix multiplication you don't want to program at the CUDA core level, you want to program at the streaming multiprocessor level.

There is no such thing as uncoalesced reads at the streaming multiprocessor level. It just doesn't exist.

mma.sync.aligned expects you to program at the streaming multiprocessor level. This is because the threads are sharing tensor cores, meaning that the concept of individual threads is really poorly thought out in this context.

I also have to comment on the Blogpost itself.

>Kernel 5 - Tune Tile Dimensions

This section is unlovingly named "tune tile dimensions", when you should explicitly say what you did: you allocated more memory towards accumulators. Matrix multiplication is accumulator bound.

This also plays into the misconception early in the post.

>To illustrate the best case scenario, imagine that fast memory was large enough to fit A,B and C in their entirety.

This is not correct, because you can split matrix multiplication along it's k or D dimension. You only need to keep the accumulator C in fast memory. If you can keep all of C in fast memory, then you only need to keep one vector of dimension n for A and one vector of dimension n for B in fast memory. Obviously in the case of tiles you want a vector of tiles. In the extreme limit the memory of A and B is irrelevant, because each element in the vector for A and B is read n times. As your matrix gets bigger, the sharing gets more extreme! O(n) memory for the inputs A and B but O(n^2) for C means that arithmetic intensity is inevitable!

The accumulator bottleneck is the essence of matrix multiplication!

1 comments

author here: thanks so much for the feedback. I agree, 'more memory for accumulators' would be a better title for this section.

and I also see the misconception you are pointing out in the 'best case' section. re-reading this, I realize that if you are accumulating C using outer products between columns of A and rows of B, you can achieve O(N) intensity while storing all of C, and just a column of A and a row of B in fast memory. Whereas if you are using inner products, you need all of A,B,C in fast memory to achieve O(N) intensity.

I guess when I wrote this I was just thinking about an inner product, which is too narrow. Thanks I might tweak this section :)