| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by yk 806 days ago
	> a significant amount of memory goes into the KV cache Is there a good paper (or talk) how inference looks at scale? (Kinda like ELI-using-single-gpus)

1 comments

AaronFriel 805 days ago

The PagedAttention paper is a good starting point as it represents the first major open source inference engine that had "pretty good" batch performance for transformers.

https://arxiv.org/pdf/2309.06180

link