Hacker News new | ask | show | jobs
by bdcs 86 days ago
Here's my attempt at a undergrad-level summary (corrections welcome!):

The core idea is to quantize KV cache, but do so in a way that destroys minimal information. In this case, it's similarly scores between vectors. The simplest way to do this is to change all the elements from 16bit of precision to, say, 4 bits (Scalar Quant.). These papers improve on it by realizing: almost all the energy (concentration of measure) is towards the equator of the hypersphere (normally distributed as 1/d; d=vector dimensionality). (The curse/blessing of hyper dimensionality strikes again.) So when we quantize the elements (think "latitudes", e.g. to the nearest degree) we destroy a lot of information because basically all the vectors were around the equator (so some latitudes have a lot of vectors and some have very few). The idea is to rotate the vectors away from the equator so they're more consistently distributed (to better preserve the entropy during quantization, which I guess was amitport's DRIVE idea). PolarQuant does a hyperpolar coordinate transform which superficially seems neat for preserving entropy because of this equator/polar framing (and ultimately unnecessary as shown by TurboQuant). They also realized there's a bias to the resulting vectors during similarity, so they wrote the QJL paper to fix the bias. And then the TurboQuant paper took PolarQuant + QJL, removed the hyperpolar coords, and added in some gross / highly-pragmatic extra bits for important channels (c.f. elements of the vectors) which is sort of a pathology of LLMs these days but it is what it is. Et voila, highly compressed KV Cache. If you're curious why you can randomly rotate the input, it's because all the vectors are rotated the same, so similarity works out. You could always un-rotate to get the original, but there's no need because the similarity on rotated/unrotated is the same if you compare apples to apples (with the QJL debiasing). Why was PolarQuant even published? Insu Han is solely on that paper and demanded/deserved credit/promotion, would be my guess. The blog post is chock-full of errors and confusions.

2 comments

Some corrections: the vectors are un-rotated in practice for future query vectors. This could be removed with a slightly different LLM arch.

PolarQuant does live on in TurboQuant's codebooks for quantization which borrows from the hyperpolar coords

> added in some gross / highly-pragmatic extra bits for important channels

I'm curious what you meant by that. I understood it to only have the MSE quantization vector, a 1-bit QJL vector, and a scalar magnitude.

> PolarQuant does live on in TurboQuant's codebooks for quantization which borrows from the hyperpolar coords

Isn't the turbo codebook the irregularly spaced centroid grid?

> extra bits per channel

Page 18 of the paper: > As shown in Table 1, our approach outperforms other methods for both Llama-3.1-8B-Instruct and Ministral-7B-Instruct, achieving significantly higher average scores. We evaluate our method using 2.5-bit and 3.5-bit quantization during text generation. These non-integer bit precisions result from our strategy of splitting channels into outlier and non-outlier sets, and applying two independent instances of TurboQuant to each, allocating higher bit precision to outliers. This outlier treatment strategy is consistent with prior work [63, 51] . For example, in our 2.5-bit setup, 32 outlier channels are quantized at 3 bits, while the remaining 96 channels use 2 bits, leading to an effective bit precision of (32 ×3 + 96×2)/128 = 2.5. For 3.5-bit quantization, a different ratio of outliers and regular channels leads to a higher effective bit precision. Despite using fewer bits than competing techniques, TurboQuant maintains performance comparable to unquantized models

So they find channels / indicies-of-the-vector that are important and give them more bits (3 bits) than the rest (2 bits).

>Isn't the turbo codebook the irregularly spaced centroid grid?

yes I believe so. They mention it's informed by the concentration of measure and the uncorrelated/independent vectors after the initial conditioning rotation. I feel like it was informed by PolarQuant, but that may just be how I intuit what's going on (because thinking about this in polar coordinates makes more sense in my head). IOW, I think the irregular spacing is maybe informed by TurboQuant.

However they do say, slightly to the contrary: "We find optimal scalar quantizers for random variables with Beta distributions by solving a continuous 1-dimensional k-means problem using the Max-Lloyd algorithm."

Beautiful explanation, thanks!