Hacker News new | ask | show | jobs
by david-gpu 590 days ago
The parameters of the matrix multiply, such as the size of the matrices, impose some limits to how close you can get to the peak theoretical performance in a particular GPU. Not all possible matrix multiplies are equally valuable to optimize a priori, so the hardware is designed to perform best on problems that are financially significant, such as modern LLMs.

As for handcoded assembly, do you believe that it would be financially sound to hand code and maintain thousands of kernels that way, even if you believed that they would be faster?

1 comments

> As for handcoded assembly, do you believe that it would be financially sound to hand code and maintain thousands of kernels that way, even if you believed that they would be faster?

Why not? We do so for cryptographic primitives and video codecs. And why are you talking about “thousands of kernels”? AI programs only need a small amount of different kernels so it doesn't sound intractable.

> AI programs only need a small amount of different kernels

That is not the case. What appears like a simple matmul operation actually requires these libraries to select which specific kernel out of the many internally available to execute.

If you are curious to learn more, NVidia open sourced a library called Cutlass some years ago. And remember that is only what they are willing to open source.

Is that really different from AV codecs in terms of scale though?
Yes, you can peek under the hood of cuBLAS and notice that it has dozens of kernels for different problem sizes. It’s not generally the case that when you do h264 at a different crf you have a completely different tiling strategy that you have to implement.
I am not at liberty to discuss more than that.