Hacker News new | ask | show | jobs
by yorwba 232 days ago
In particular, part of the paper is about dynamically adjusting the number of tokens generated in parallel while maintaining roughly the same output quality as one-token-at-a-time decoding. The other part is about the KV caching strategy they use to speed up parallel decoding further.