| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by saagarjha 590 days ago
	In theory, yes. In practice you will probably be forced to tile your GEMM and incur the penalty of redundant memory accesses.

1 comments

stephencanon 590 days ago

Sure, but still on each tile, you do O(k^3) compute with O(k^2) memory, and you generally arrange things so that at least one tile is in L1 and at least one other is in L2/LLC (using CPU idioms), so again, you have plenty of bandwidth (typical choices of k are in the ballpark of ~32, and a 32:1 compute to memory ratio is just fine on most hardware, especially if some of those accesses are coming from fast memory)

link

saagarjha 589 days ago

I don't think so? It is too late for me to actually do the math on this but if you take the degenerate case where the tile size is literally 1 element then you will do as many loads as arithmetic operations. Thus I would consider any sort of fixed tiling (which you would be resigned to due to your caches being of limited size) requiring O(n^3) loads?

link