| Actually, the only numbers every LLM developer should know are their accelerator specs.
For example: A100 specs: - 312e12 BF16 FLOPS - 1555e9 GB/s HBM bandwidth H100: - 1000e12/2000e12 BF16/INT8 FLOPS (apply ~0.7 flops efficiency multiplier because h100s power throttle extremely quickly) - 3000 GB/s HBM bandwidth --- For a 13B model on an A100, this nets: 13e9 * 2 bytes per param = 26 GB HBM required (at bf16) 26e9/1555e9 = 17ms / token small-batch latency (~60 tokens / second) What about large batches? latency for some batch size B is 13e9 * 2 FLOP per param * B / 312e12 We want B such that we're just about no longer HBM bound:
26e9/312e12 * B = 17ms <=> 17e-3/(26e9/312e12) giving a batch size of 204. At that batch size (and all larger batch sizes), the a100 delivers a throughput of
B * 1/17ms = 12000 tokens / second --- KV caching, multi-gpu and -node comms and matmul efficiencies left as an exercise to the reader :) |