|
|
|
|
|
by torginus
719 days ago
|
|
Honestly all this math sounds a bit fishy to me. A H200 has about 5TB/s bandwidth. If we assume a pure matrix multiply workload, we need to fetch 2 FP16 values, which means we are capped at 1.25 TFLOPs. Even best case scenario, where one of the operands is cached, and the other is an FP8, we are only at 5 TB/s which is way less than what the H200 can do. I don't get how throwing more ALUs at the problem would make things better, it's very much bandwidth constrained. That's why Groq exists which has a ton of SRAM on chip. |
|
For [M, K] @ [K, N] read is O(MK + NK) compute is O(MNK) A quick estimate for compute/bandwidth is min(M, N, K). M is batchsize, so they can just blow that up to get nice looking numbers. On Llama 70B, min(N, K) is 3584 and 7168 for matmul's 1 and 2.
Groq needs a ton of SRAM because they optimized for batch size 1 latency, so M is very small.