Hacker News new | ask | show | jobs
by george_123 1098 days ago
this approach to managing KV cache can work with 4bit. imagine the speedup of pagedattention with quantization..
1 comments

yep, it is agonistic to 4-bit. You can deploy a 4-bit model and still use vllm + pagedattention to double or even triple your serving throughput.
If this were submitted as a new comment it would be at the top of the page.
You mean like, theoretically, in the future? Or you mean today?
probably mean agnostic, agonistic implies the opposite.
oops typo