|
|
|
|
|
by menaerus
484 days ago
|
|
> This does sound like you are saying that memory access time does NOT dominate during the decode phase. But it does. Let's take llama3-8B for an example. GFLOPS needed for self-attention per-layer per-token is roughly 0.15 GFLOPS. For simplicity reasons let's assume that we store all our weights in FP8 precision, then our load memory-bandwidth required for the same is 0.05 GB. Store memory-bandwidth is negligible. If we expand this further to a 1k tokens context, this becomes ~180 GFLOPS and ~0.35 GB per-layer per-1k-ctx. Assuming that our HW is H100, is this compute-bound or memory-bound? |
|