| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by weinzierl 590 days ago
	75% can't be the best we can do. What would reach 100% or nearly 100%? Handcoded assembly?

2 comments

jsheard 590 days ago

With GPUs it's not uncommon to run out of memory bandwidth before you max out the theoretical FLOPS. They may have a ton of bandwidth but it's never enough.

That can lead you to some pretty counter-intuitive optimizations because it's often faster to do more compute work if it means you touch less memory in the process.

link

adev_ 590 days ago

> That can lead you to some pretty counter-intuitive optimizations because it's often faster to do more compute work if it means you touch less memory in the process.

It is not specific to the GPUs: this kind of optimizations are pretty common on CPU too where latency kills you and 200 cycles spent wasted on doing compute can actually be faster than a single cache miss trying to fetch data. This is pretty common for many SIMD algorithms actually.

Memory is currently lagging behind compute on almost every type of modern hardware, and it will very likely become worst, not better.

link

wbl 590 days ago

Shouldn't the roofline inform capacity assessments?

link

KeplerBoy 590 days ago

Sure, but rooflines don't account for stuff like memory granularity. You not only have to do a lot of bytes per flop to achieve the necessary arithmetic intensity, you also have to access those bytes in a coalesced way. I.e., you want to access consecutive bytes, which are ideally already in registers.

link

stephencanon 590 days ago

For sufficiently large GEMM you should never run out of bandwidth before you max out FLOPS if your blocking is organized correctly, because the arithmetic scales like O(n^3) while the memory access scales like O(n^2).

link

saagarjha 590 days ago

In theory, yes. In practice you will probably be forced to tile your GEMM and incur the penalty of redundant memory accesses.

link

stephencanon 590 days ago

Sure, but still on each tile, you do O(k^3) compute with O(k^2) memory, and you generally arrange things so that at least one tile is in L1 and at least one other is in L2/LLC (using CPU idioms), so again, you have plenty of bandwidth (typical choices of k are in the ballpark of ~32, and a 32:1 compute to memory ratio is just fine on most hardware, especially if some of those accesses are coming from fast memory)

link

saagarjha 589 days ago

I don't think so? It is too late for me to actually do the math on this but if you take the degenerate case where the tile size is literally 1 element then you will do as many loads as arithmetic operations. Thus I would consider any sort of fixed tiling (which you would be resigned to due to your caches being of limited size) requiring O(n^3) loads?

link

david-gpu 590 days ago

The parameters of the matrix multiply, such as the size of the matrices, impose some limits to how close you can get to the peak theoretical performance in a particular GPU. Not all possible matrix multiplies are equally valuable to optimize a priori, so the hardware is designed to perform best on problems that are financially significant, such as modern LLMs.

As for handcoded assembly, do you believe that it would be financially sound to hand code and maintain thousands of kernels that way, even if you believed that they would be faster?

link

littlestymaar 590 days ago

> As for handcoded assembly, do you believe that it would be financially sound to hand code and maintain thousands of kernels that way, even if you believed that they would be faster?

Why not? We do so for cryptographic primitives and video codecs. And why are you talking about “thousands of kernels”? AI programs only need a small amount of different kernels so it doesn't sound intractable.

link

david-gpu 590 days ago

> AI programs only need a small amount of different kernels

That is not the case. What appears like a simple matmul operation actually requires these libraries to select which specific kernel out of the many internally available to execute.

If you are curious to learn more, NVidia open sourced a library called Cutlass some years ago. And remember that is only what they are willing to open source.

link

littlestymaar 590 days ago

Is that really different from AV codecs in terms of scale though?

link

saagarjha 590 days ago

Yes, you can peek under the hood of cuBLAS and notice that it has dozens of kernels for different problem sizes. It’s not generally the case that when you do h264 at a different crf you have a completely different tiling strategy that you have to implement.

link

david-gpu 590 days ago

I am not at liberty to discuss more than that.

link