| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by pythongiant 31 days ago
	KVBoost is a chunk-level KV cache reuse library for HuggingFace models (pip install kvboost). It supports two recompute strategies (selective boundary and CacheBlend), int8/int4 KV quantization for 2–4x RAM reduction, disk-backed cold storage, and 11 architectures including Llama, Qwen, Gemma, Mistral, and Phi. On Qwen2.5-3B we measured 47.9x TTFT speedup on an 8-turn conversation, 21x on code context reuse, 100–743x faster than MLX, and 3–41x faster than vLLM-MLX — including interior chunk reuse where vLLM gets zero hits. Outputs are token-for-token identical to baseline under greedy decoding. Works best on 3B+ models with 500+ token shared context. GitHub: https://github.com/pythongiant/KVBoost

3 comments

Even the things that should be normal dashes are em-dashes

En-dashes are not em-dashes, and they're standard typography for numeric ranges.

I don't get it. The output of the CacheBlend paper is in LMCache. Did you compare against vLLM with LMCache? This is confusing.

slop