Y
Hacker News
new
|
ask
|
show
|
jobs
by
ahzhou
501 days ago
It’s a tensor stored in GPU memory to improve inference throughput. Check out the PagedAttention (which introduces vLLM) paper for how most systems implement it nowadays.