|
|
|
|
|
by aifath
722 days ago
|
|
Matmul is cubic compute, but quadratic memory. For [M, K] @ [K, N] read is O(MK + NK) compute is O(MNK)
A quick estimate for compute/bandwidth is min(M, N, K). M is batchsize, so they can just blow that up to get nice looking numbers. On Llama 70B, min(N, K) is 3584 and 7168 for matmul's 1 and 2. Groq needs a ton of SRAM because they optimized for batch size 1 latency, so M is very small. |
|