| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by zozbot234 52 days ago
	The problem with this approach is that even recomputing a "draft" of the KV cache is still quadratic in context length. Maybe you can get some constant savings by always recomputing the earliest tokens, but it's not a good tradeoff as context sizes grow.

3 comments

saagarjha 52 days ago

Sure, but any classical attention mechanism is quadratic in context length.

link

zozbot234 52 days ago

But text generation is quadratic after the KV cache optimization. If every decode step now has to recompute KV cache including its latest and most expensive tokens (even with a quick, "draft" model) that's even worse.

link

zozbot234 52 days ago

BTW, I forgot to mention that you can make this work in a way, but only if your model architecture generalizes the context and attention mechanism such that it's no longer a pure sequence. So you could have a large amount of distinct "early" token sequences, with each being self-contained and not depending on any other tokens, e.g. your source code files might be such. Then later parts of the context would of course depend on all of those files as usual. This makes prefill for the earlier context both reusable and cheaply recomputable throughout, at the cost of losing some dependencies that would've been previously accounted for: your model becomes faster and more efficient, but perhaps not quite as smart.

link

somnial 51 days ago

true, but no reason the predictor model couldn't use linear attention (i.e. mamba, GDN etc) to predict KV caches

link