|
|
|
|
|
by menaerus
481 days ago
|
|
True. I wrote some software that does these calculations for me besides the ones I already have on the paper. I confused two different graphs. So, the final number would be ~0.6 GFLOPS (self-attention across heads) + ~0.15 GFLOPS (attention) + ~1 GFLOPS (ffwd) which in total give or take is ~2 GFLOPS per-layer. Bandwidth-wise, the ~1GB number I previously gave was also wrong (llama3-70B has 8 KV heads). Now, with more precise calculations that figure is ~0.6 GB per-layer. So, at batch_size=1, FP8 precision, 1024 tokens, during the decode phase with KV-cache, we need ~2GFLOPS of compute and ~0.6GB of bandwidth per each layer. Still looks compute-bound to me. |
|
H100 has 3.3TB/s HBM bandwidth on paper, and ~1000TFLOPS bf16 compute on paper. That's 1:300. 0.6GB vs ~2GFLOPS is 1:3. Tell me how is this compute bound?
(also, your number, even after accounting for GQA, is still off. You usually can't store kvcache in fp8.)