| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by VierScar 314 days ago
	No I don't think it's the bits. I would say it's the computation. Inference requires performing a lot of matmul, and with more tokens the number of computation operations increases exponentially - O(n^2) at least. So increasing your context/conversation will quickly degrade performance I seriously doubt it's the throughput of memory during inference that's the bottleneck here.

3 comments

MereInterest 314 days ago

Nitpick: O(n^2) is quadratic, not exponential. For it to “increase exponentially”, n would need to be in the exponent, such as O(2^n).

link

esafak 314 days ago

To contrast with exponential, the term is power law.

link

zozbot234 314 days ago

Typically, the token generation phase is memory-bound for LLM inference in general, and this becomes especially clear as context length increases (since the model's parameters are a fixed quantity.) If it was pure compute bound there would be huge gains to be had by shifting some of the load to the NPU (ANE) but AIUI it's just not so.

link

summarity 314 days ago

It literally is. LLM inference is almost entirely memory bound. In fact for naive inference (no batching), you can calculate the token throughput just based on the model size, context size and memory bandwidth.

link

zozbot234 314 days ago

Prompt pre-processing (before the first token is output) is raw compute-bound. That's why it would be nice if we could direct llama.cpp/ollama to run that phase only on iGPU/NPU (for systems without a separate dGPU, obviously) and shift the whole thing over to CPU inference for the latter token-generation phase.

(A memory-bound workload like token gen wouldn't usually run into the CPU's thermal or power limits, so there would be little or no gain from offloading work to the iGPU/NPU in that phase.)

link