Hacker News new | ask | show | jobs
by stephencanon 590 days ago
For sufficiently large GEMM you should never run out of bandwidth before you max out FLOPS if your blocking is organized correctly, because the arithmetic scales like O(n^3) while the memory access scales like O(n^2).
1 comments

In theory, yes. In practice you will probably be forced to tile your GEMM and incur the penalty of redundant memory accesses.
Sure, but still on each tile, you do O(k^3) compute with O(k^2) memory, and you generally arrange things so that at least one tile is in L1 and at least one other is in L2/LLC (using CPU idioms), so again, you have plenty of bandwidth (typical choices of k are in the ballpark of ~32, and a 32:1 compute to memory ratio is just fine on most hardware, especially if some of those accesses are coming from fast memory)
I don't think so? It is too late for me to actually do the math on this but if you take the degenerate case where the tile size is literally 1 element then you will do as many loads as arithmetic operations. Thus I would consider any sort of fixed tiling (which you would be resigned to due to your caches being of limited size) requiring O(n^3) loads?