| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by wskwon 1099 days ago
	Yes, vLLM focuses on maximizing throughput when the VRAM is fully utilized. Nevertheless, I believe users can still benefit from vLLM even if they don't utilize the memory to its full capacity, because vLLM also includes other optimizations orthogonal to the PagedAttention (e.g., optimized CUDA kernels).