| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by thewataccount 1099 days ago

This is really cool to see.

> Large: Takes up to 1.7GB for a single sequence in LLaMA-13B.

> Dynamic: Its size depends on the sequence length, which is highly variable and unpredictable. As a result, efficiently managing the KV cache presents a significant challenge. We find that existing systems waste 60% – 80% of memory due to fragmentation and over-reservation.

This mentions improvements for throughput which is great, and it mentions memory savings. I'm a bit confused how 80% of the memory could be wasted by the KV cache when the vast majority of the memory is usually holding the model itself?

How much memory savings does this translate to effectively for say a 30B 4bit model?

1 comments

zhisbug 1099 days ago

This really depends on what GPUs you use. If you GPUs has very small amount of memory, vLLM will help more.

vLLM addresses the memory bottleneck for saving KV caches and hence increases the throughput.

link