Show HN: KVBoost – chunk-level KV cache reuse for HuggingFace, 5–48x faster TTFT

Y	Hacker News new \| ask \| show \| jobs

	Show HN: KVBoost – chunk-level KV cache reuse for HuggingFace, 5–48x faster TTFT (pythongiant.github.io)
	20 points by pythongiant 30 days ago

7 comments

hexnuts 30 days ago

Bad site design, if I can't scroll to see the next slide, that's just broken.

link

pythongiant 30 days ago

Makes sense, fixing that. thanks!

link

pythongiant 30 days ago

Here's the repository incase anyone wants to have a look at the code. leave a star if you find it interesting :P https://github.com/pythongiant/KVBoost

link

stpedgwdgfhgdd 30 days ago

I just dont get why people choose Python and not e.g. Go for high performance problems.

link

Yoric 30 days ago

Go is pretty good at performance, but pretty bad at expressing domain-specific logics. Python is the opposite, but once you have isolated the parts that need to be optimized, it's quite easy to rewrite them in a native language (in particular, the Rust-Python bindings are really good, although in this project, it's C++).

link

sigmoid10 30 days ago

Python is a very convenient skeleton for gluing together high performance modules that were written in C or cuda. Writing boilerplate code in those to adapt them to your project is much more inconvenient.

link

larme 30 days ago

Go is not high performance enough. Like what others said, you implement the high performance part in C++ and use python to glue them.

link

pythongiant 30 days ago

my initial choice was to use Rust for this actually (Probably should've too :P) but i went with python for an initial mvp/skeleton for a future rewrite

link

x0ruman 30 days ago

The functionality is impressive, but the website needs some work

link

pythongiant 30 days ago

Thanks! this is a weekend project that i am working on in the side just to learn more about ml engineering and custom cuda kernels. didnt think much about the website

link

npodbielski 30 days ago

Drop in replacement for what exactly? Can I use it with llama.cpp and Vulkan? Or vLLM and ROCm?

link

pythongiant 30 days ago

KVBoost is a drop-in replacement for AutoModelForCausalLM. Same API surface (KVBoost.from_pretrained(...), engine.generate(...)), but with cross-request KV reuse, FlashAttention-2, AWQ layer streaming, and speculative decoding bolted on.

link

sakex 30 days ago

Is this based on paged attention with hashing of the pages?

link

pythongiant 30 days ago

KVBoost is a chunk-level KV cache reuse library for HuggingFace models (pip install kvboost). It supports two recompute strategies (selective boundary and CacheBlend), int8/int4 KV quantization for 2–4x RAM reduction, disk-backed cold storage, and 11 architectures including Llama, Qwen, Gemma, Mistral, and Phi. On Qwen2.5-3B we measured 47.9x TTFT speedup on an 8-turn conversation, 21x on code context reuse, 100–743x faster than MLX, and 3–41x faster than vLLM-MLX — including interior chunk reuse where vLLM gets zero hits. Outputs are token-for-token identical to baseline under greedy decoding. Works best on 3B+ models with 500+ token shared context. GitHub: https://github.com/pythongiant/KVBoost

link

snovv_crash 30 days ago

Even the things that should be normal dashes are em-dashes

link

mrob 30 days ago

En-dashes are not em-dashes, and they're standard typography for numeric ranges.

https://en.wikipedia.org/wiki/Dash#Ranges_of_values

link

arjie 30 days ago

I don't get it. The output of the CacheBlend paper is in LMCache. Did you compare against vLLM with LMCache? This is confusing.

link

pferdone 30 days ago

slop

link