|
|
|
|
|
by rfoo
475 days ago
|
|
> Still looks compute-bound to me. H100 has 3.3TB/s HBM bandwidth on paper, and ~1000TFLOPS bf16 compute on paper. That's 1:300. 0.6GB vs ~2GFLOPS is 1:3. Tell me how is this compute bound? (also, your number, even after accounting for GQA, is still off. You usually can't store kvcache in fp8.) |
|