Hacker News new | ask | show | jobs
by aifath 722 days ago
Matmul is cubic compute, but quadratic memory.

For [M, K] @ [K, N] read is O(MK + NK) compute is O(MNK) A quick estimate for compute/bandwidth is min(M, N, K). M is batchsize, so they can just blow that up to get nice looking numbers. On Llama 70B, min(N, K) is 3584 and 7168 for matmul's 1 and 2.

Groq needs a ton of SRAM because they optimized for batch size 1 latency, so M is very small.