| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by naasking 83 days ago
	This sounds great! TurboQuant does KV cache compression using quantization via rotations, and ParoQuant [1] does weight compression using quantization via rotations! So we can get 4-bit weights that match bf16 precision, the KV cache goes down to 3 bits per key. This brings larger models and long contexts into the range of "possibly runnable" on beefy consumer hardware. [1] https://github.com/z-lab/paroquant