Y
Hacker News
new
|
ask
|
show
|
jobs
by
dannyw
3 hours ago
Correct. The main bottleneck with LLM inference is, and have always been, memory bandwidth.
TPS = active weights in GB / your memory bandwidth.
That’s it for decode. That’s all.