Hacker News new | ask | show | jobs
by ahzhou 501 days ago
It’s a tensor stored in GPU memory to improve inference throughput. Check out the PagedAttention (which introduces vLLM) paper for how most systems implement it nowadays.