Y
Hacker News
new
|
ask
|
show
|
jobs
by
saagarjha
4 days ago
Sure, but any classical attention mechanism is quadratic in context length.
1 comments
zozbot234
4 days ago
But text generation is quadratic
after
the KV cache optimization. If every decode step now has to recompute KV cache including its latest and most expensive tokens (even with a quick, "draft" model) that's even worse.
link