Hacker News new | ask | show | jobs
by woadwarrior01 971 days ago
BERT style encoder-only models, like the embedding model being discussed here, don't need a KV cache for inference. A KV cache is only needed for efficient inference with encoder-decoder and decoder-only (aka GPT) models.