Hacker News new | ask | show | jobs
by tveita 606 days ago
So SpinQuant learns a rotation for activations and weights that, to my understanding, "smear" the outlier weights out so you don't get extreme values in any one weight.

Random anecdote warning - In the old days, before vector search became AI and everyone and their dog offered a vector database, I had a task that required nearest neighbour search in a decent amount of high-dimensional vectors.

I tried quantizing them to bit vectors in an index and scanning through it to get an initial set of candidates. Performance was actually quite decent - reading through RAM linearly is fast! But the selectivity wasn't great.

Somewhere along the way I found this paper[1] that iteratively finds a rotation to apply before quantization to reduce the quantization error. Very similar goal to SpinQuant, but focused on bit quantization only.

As it turns out the 'random rotation' baseline they benchmark against worked great for my use case, so I never tried implementing the fancier algorithm. But it's a pretty rare day at work that "apply a random rotation matrix to a 128-dimensional vector" is the solution to my problem.

[1] https://ieeexplore.ieee.org/abstract/document/6296665 / https://slazebni.cs.illinois.edu/publications/ITQ.pdf

6 comments

> But it's a pretty rare day at work that "apply a random rotation matrix to a 128-dimensional vector" is the solution to my problem.

Funny enough, if you visualize a vector-embedding's latent-space features using that "points on the surface of a hypersphere" analogy that ML programmers like to use — and you assume a really low quantization, say, 1-bit — then you can almost picture the hypersphere surface as a black-and-white vector image, the points as arbitrary-precision vector positions where you want to place dots... and your goal as quantizing those positions to reduce the storage costs down to storing a raster bitmap.

And that problem has a name: dithering!

Oddly enough, for what may or may not be coincidental reasons, what we want in ML terms (keeping the learned associational weights between features constant) is very similar to what we want from the output of image dithering: to not allow the dots to come together to create false features or false voids.

And how do we do that? In dithering, we usually apply a set of random perturbations to the vectorized points. Which, for image dithering, just look like translations in 2D space... but, in a higher-dimensional space, might very well best be analytically modelled as rotations about the origin!

Another way to understand dithering is by smearing the frequency spectrum of the original image preventing extreme frequency values to distort the image after quantization - this can be done by applying kernel filters on the original image.

Which I think is what is happening with SpinQuant as well - a smoothing of the frequency spectrum of the model weights, confirmed by the smearing of the singular values of the weight matrices.

Fascinating! Does that mean you could improve performance further with Floyd–Steinberg dithering? (I.e. instead of rotating randomly, you track accumulated quantization error and add that amount instead.)
Floyd-Steinberg etc mostly look better to the human eye, but I'm not sure in what more 'objective' sense they would be better than random dithering?
Floyd-Steinberg is one sort of quasi-random algorithm, but there are others. People often use quasi-random rather than true randomness when they want to avoid sample points bunching together. They tend to be more evenly distributed. That can get more important in higher-dimension space where it's easy to completely miss sampling large volumes because a truly random point set has too many degrees of freedom.
Interesting.

What you are describing reminds me of Low discrepancy sequences: https://en.wikipedia.org/wiki/Low-discrepancy_sequence

Though these methods have their problems and blind-spots, too, and are often outdone by random sampling with even slightly higher sample count, while preserving all the simplicity and (statistical) guarantees you get from randomness.

But images have regular adjacent pixels to work with. Don't think the algo can be straight applied to irregularly placed points in manydimensional space.
The best type of dithering is done with error diffusion. There's a convolutional kernel the diffuses the error over multiple adjacent data points.
I'm just on the edge of understanding this but if I'm visualizing this right you're talking about a point source at the center of a sphere and a bitmap indicating where all the vectors intersect the surface. But that would mean the lengths would all be the same.

Isn't it the lengths/distances to neighbors that is the main information being stored in a vector db? Or is it just that what you're talking about only concerns the angles so the lengths are not part of the discussion?

I'm a dev but still have a lot to learn about ML :)

My understanding is that yes, it actually is normalized to have the lengths all be the same, and thus the angle from (hyperdimensional) 0,0,0,(...n) is all that matters. The "distance between two embeddings" is able to simply to cosign of the two angles.
Seems really intriguing could you help me grok how this random perturbations of the points of the hypersphere surface are related to smearing the model weights?
I'm sorry, I don't understand the language you're speaking. English please?

(Just kidding - but if you have any recommendations for learning resources to get started being able to understand what you're talking about, I'd greatly appreciate it.)

One of the fun things about signals theory is how the same basic concept will show up in apparently unrelated places.

Example from electrical engineering: microprocessors will have a "clock" frequency, say, 16Mhz. But when you haul a wire up to VCC and pull it back down to ground, some amount of the power will be radiated away as radio waves. If your clock is at a constant rate, then you'll have a big spike of radiated noise at 16MHz, and the FCC will be unhappy.

So modern devices cheat it by dithering around the central frequency. If you bounce from 15.9998MHz to 16.001 to 15.998 then the same amount of power will be radiated, but smeared across a bigger frequency, enough to get you lower than the regulatory threshold. Spread spectrum clock generation. https://www.analog.com/en/resources/technical-articles/clock...

If you look in your PC's BIOS settings, spread spectrum is usually an option, and you can disable it if you want your computer to be slightly noisier.

in 2024 you paste the dense comment into your favorite LLM (preferably multiple) and ask it to explain on your desired level (whatever that may be). works remarkably well for every topic I tried it with (e.g. jargon-heavy financial tweets.)
Interestingly, FAISS does exactly that before doing Product Quantization and it works very well (errors are much lower compared to no rotation). They call it “optimal PQ”. During training time, they iterate to find a good candidate and save the best one.

Perhaps not entirely coincidentally, FAISS is also maintained by FB.

https://faiss.ai/cpp_api/struct/structfaiss_1_1OPQMatrix.htm...

I find the geometrical intuition of rotating a vector in high dimensional space to minimize its largest values (vector basis projections) beautiful.

I'm no expert and I'm sure this has been tried by many people already - but would it be possible to reduce the computational effort instead by using SVD decomposition, spreading the singular values and then reapplying the original singular values and recomposing the matrix using the quantized versions of the SVD matrices?

Tangentially related to the idea of "apply a random rotation matrix" is one where you apply a random matrix to a set of points to preserve distances between them but transform them into a lower dimensional space. This is captured by the JL Lemma [1].

[1] - https://en.wikipedia.org/wiki/Johnson%E2%80%93Lindenstrauss_...

Actually, “apply a random matrix” is often the solution to a large dimensional space involving near neighbours.

The Johnson-Lindenstrauss lemma asserts that a multiplying by a random matrix (some conditions apply, but iirc rotation matrices satisfy them) keeps, in many senses, the distances between points even if the dimension drops very significantly (some conditions apply but usually satisfied by real world data)

This is, in fact, the theoretical underpinning of compressed sensing.

You might like an information-theoretic take on SpinQuant and the likes [1].

tl;dr: round((2*R)*x) is not a great idea for an R-bit quantization.

[1] https://arxiv.org/abs/2410.13780