| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by menaerus 517 days ago
	This obviously depends on the hardware and the shape of the LLM model itself but, generally speaking, it's quite the opposite. The idea of batching is to grow the compute bandwidth per single request, bigger batch sizes with much more compute will put more stress to the underlying (cache, RAM) memory subsystem, no? For N self-attention layers, there will be N compute (tensor) units doing the computation in parallel. To retire the computation, each compute unit will need to LOAD/STORE from and to the chip memory. At batch size B, this only becomes a bigger scale, e.g. B * (N, LOAD/STORE).

1 comments

lostmsu 517 days ago

If you have a batch of size 1, for every token you need to load the entire model from memory into cache as you go through it. If it is 32 you can produce 32 tokens while doing the same amount of loading from VRAM.

link

menaerus 516 days ago

That's not how it works because if what you're saying had been true then the self-attention memory complexity would be O(1), e.g. regardless of the batch size. This obviously isn't the case since each batch computation necessitates it's own load/store memory bandwidth. I suggest reading one of the transformers papers to really understand how it works.

link

lostmsu 516 days ago

This was a simplification. Of course you need some extra VRAM I/O based on your KV cache size.

But assuming your KV cache size is << model size, that simplification is pretty accurate.

See, e.g. https://www.databricks.com/blog/llm-inference-performance-en...

You can just scroll to the first chart they have that explains the idea.

link