Hacker News new | ask | show | jobs
by bioemerl 1098 days ago
I'm spoiled by 4 bit and unfortunately it doesn't appear to be supposed here so this isn't of much use to me, but it's awesome to see people working on the inference speed side of things regardless.
1 comments

this approach to managing KV cache can work with 4bit. imagine the speedup of pagedattention with quantization..
yep, it is agonistic to 4-bit. You can deploy a 4-bit model and still use vllm + pagedattention to double or even triple your serving throughput.
If this were submitted as a new comment it would be at the top of the page.
You mean like, theoretically, in the future? Or you mean today?
probably mean agnostic, agonistic implies the opposite.
oops typo