| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by magicalhippo 30 days ago
	The KV cache size is a joker though. Different models use very different amounts of memory per token in the KV cache. The VRAM requirements for say 64k context can vary almost by an order of magnitude. So the download size might indicate you should have room for the model, how much context you can fit in the leftover VRAM budget is harder to predict at a glance. That some models like Qwen3.6 27B seems to not be very affected by Q8 quantized KV cache while others degrade heavily doesn't make it easier.