|
|
|
|
|
by thewataccount
1099 days ago
|
|
This is really cool to see. > Large: Takes up to 1.7GB for a single sequence in LLaMA-13B. > Dynamic: Its size depends on the sequence length, which is highly variable and unpredictable. As a result, efficiently managing the KV cache presents a significant challenge. We find that existing systems waste 60% – 80% of memory due to fragmentation and over-reservation. This mentions improvements for throughput which is great, and it mentions memory savings. I'm a bit confused how 80% of the memory could be wasted by the KV cache when the vast majority of the memory is usually holding the model itself? How much memory savings does this translate to effectively for say a 30B 4bit model? |
|
vLLM addresses the memory bottleneck for saving KV caches and hence increases the throughput.