| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by suprjami 82 days ago
	Some models really suffer badly from KV quantisation. You can also take a speed hit using dissimilar K and V types. TurboQuant seems to be the next big thing in context memory usage. Polar coordinates achieving ~5x reduction in memory usage with minimal/no quality loss, and even a slight speedup in some cases.

1 comments

LuxBennu 82 days ago

yeah fair point, it's definitely model dependent. i've had good results with qwen but tried it on a smaller mistral variant once and the output quality dropped noticeably even at q8 for both. the speed hit from mixed types hasn't been bad on apple silicon in my experience but i can see it mattering more on cuda.

link