| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by Kubuxu 974 days ago
	Your KV cache size is linear with context size which might put you tight on memory. There is also increased cost of recalculating KV cache of context window when the window has to move but this is close to being solved with streaming LLMs.

1 comments

woadwarrior01 974 days ago

BERT style encoder-only models, like the embedding model being discussed here, don't need a KV cache for inference. A KV cache is only needed for efficient inference with encoder-decoder and decoder-only (aka GPT) models.

link