Hacker News new | ask | show | jobs
by rfoo 475 days ago
> Still looks compute-bound to me.

H100 has 3.3TB/s HBM bandwidth on paper, and ~1000TFLOPS bf16 compute on paper. That's 1:300. 0.6GB vs ~2GFLOPS is 1:3. Tell me how is this compute bound?

(also, your number, even after accounting for GQA, is still off. You usually can't store kvcache in fp8.)