| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by stephencanon 590 days ago
	For sufficiently large GEMM you should never run out of bandwidth before you max out FLOPS if your blocking is organized correctly, because the arithmetic scales like O(n^3) while the memory access scales like O(n^2).

1 comments

saagarjha 590 days ago

In theory, yes. In practice you will probably be forced to tile your GEMM and incur the penalty of redundant memory accesses.

link

stephencanon 590 days ago

Sure, but still on each tile, you do O(k^3) compute with O(k^2) memory, and you generally arrange things so that at least one tile is in L1 and at least one other is in L2/LLC (using CPU idioms), so again, you have plenty of bandwidth (typical choices of k are in the ballpark of ~32, and a 32:1 compute to memory ratio is just fine on most hardware, especially if some of those accesses are coming from fast memory)

link

saagarjha 589 days ago

I don't think so? It is too late for me to actually do the math on this but if you take the degenerate case where the tile size is literally 1 element then you will do as many loads as arithmetic operations. Thus I would consider any sort of fixed tiling (which you would be resigned to due to your caches being of limited size) requiring O(n^3) loads?

link