Hacker News new | ask | show | jobs
by saagarjha 4 days ago
Sure, but any classical attention mechanism is quadratic in context length.
1 comments

But text generation is quadratic after the KV cache optimization. If every decode step now has to recompute KV cache including its latest and most expensive tokens (even with a quick, "draft" model) that's even worse.