| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by crashocaster 1084 days ago

Actually, the only numbers every LLM developer should know are their accelerator specs. For example:

A100 specs:

- 312e12 BF16 FLOPS

- 1555e9 GB/s HBM bandwidth

H100:

- 1000e12/2000e12 BF16/INT8 FLOPS

(apply ~0.7 flops efficiency multiplier because h100s power throttle extremely quickly)

- 3000 GB/s HBM bandwidth

---

For a 13B model on an A100, this nets:

13e9 * 2 bytes per param = 26 GB HBM required (at bf16)

26e9/1555e9 = 17ms / token small-batch latency (~60 tokens / second)

What about large batches?

latency for some batch size B is 13e9 * 2 FLOP per param * B / 312e12

We want B such that we're just about no longer HBM bound: 26e9/312e12 * B = 17ms

<=> 17e-3/(26e9/312e12)

giving a batch size of 204.

At that batch size (and all larger batch sizes), the a100 delivers a throughput of B * 1/17ms = 12000 tokens / second

---

KV caching, multi-gpu and -node comms and matmul efficiencies left as an exercise to the reader :)