| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by ahzhou 501 days ago
	It’s a tensor stored in GPU memory to improve inference throughput. Check out the PagedAttention (which introduces vLLM) paper for how most systems implement it nowadays.