Hacker News new | ask | show | jobs
by yk 760 days ago
> a significant amount of memory goes into the KV cache

Is there a good paper (or talk) how inference looks at scale? (Kinda like ELI-using-single-gpus)

1 comments

The PagedAttention paper is a good starting point as it represents the first major open source inference engine that had "pretty good" batch performance for transformers.

https://arxiv.org/pdf/2309.06180