Hacker News new | ask | show | jobs
by AaronFriel 759 days ago
The PagedAttention paper is a good starting point as it represents the first major open source inference engine that had "pretty good" batch performance for transformers.

https://arxiv.org/pdf/2309.06180