Hacker News new | ask | show | jobs
by wskwon 1099 days ago
Yes, vLLM focuses on maximizing throughput when the VRAM is fully utilized. Nevertheless, I believe users can still benefit from vLLM even if they don't utilize the memory to its full capacity, because vLLM also includes other optimizations orthogonal to the PagedAttention (e.g., optimized CUDA kernels).