| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by joshjob42 54 days ago
	Technically true, but if we're talking about local models, overwhelmingly you're gonna be bandwidth bound. You need about 2 flops per active parameter per token. An M5 chip has what, 150-200GB of bandwidth? But it can easily do something like 16tflops of fp16, so you're talking like 100 flops per byte of bandwidth. Which is just to say that in a batch=1 scenario, ie one user, you're only gonna use a few % of the GPU while you're totally saturated your memory bandwidth. For all practical purposes at the consumer level, take your memory bandwidth, divide by the size of the model, and that gives you the max tok/s throughput you're gonna get. Even a 5090 has something like 50-60 flops per byte of bandwidth, you just can't saturate the compute without running large batches. (At least at inference, prefill is obviously more compute bound).