Hacker News new | ask | show | jobs
by brucethemoose2 1099 days ago
Reading between the lines, it sounds like some of the speedup comes from VRAM savings on an otherwise close to full GPU?

This is definitely cool and needed, but it might not be so dramatic running 3-5 but quant on a less full GPU.

1 comments

Yes, vLLM focuses on maximizing throughput when the VRAM is fully utilized. Nevertheless, I believe users can still benefit from vLLM even if they don't utilize the memory to its full capacity, because vLLM also includes other optimizations orthogonal to the PagedAttention (e.g., optimized CUDA kernels).