Hacker News new | ask | show | jobs
by zozbot234 99 days ago
Additional compute is generally a win for prefill, while memory bandwidth is king for decode. KV cache however is the main blocker for long context, so it should be offloaded to system RAM and even to NVMe swap as context grows. Yes that's slow on an absolute basis but it's faster (and more power efficient, which makes everything else faster) than not having the cache at all, so it's still a huge win.
1 comments

Well if you do that then you reverse the strengths of your system. It might be best to work with the context length you can offload, like a normal person.