| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by dannyw 3 hours ago

Correct. The main bottleneck with LLM inference is, and have always been, memory bandwidth.

TPS = active weights in GB / your memory bandwidth.

That’s it for decode. That’s all.