| So SpinQuant learns a rotation for activations and weights that, to my understanding, "smear" the outlier weights out so you don't get extreme values in any one weight. Random anecdote warning - In the old days, before vector search became AI and everyone and their dog offered a vector database, I had a task that required nearest neighbour search in a decent amount of high-dimensional vectors. I tried quantizing them to bit vectors in an index and scanning through it to get an initial set of candidates.
Performance was actually quite decent - reading through RAM linearly is fast! But the selectivity wasn't great. Somewhere along the way I found this paper[1] that iteratively finds a rotation to apply before quantization to reduce the quantization error. Very similar goal to SpinQuant, but focused on bit quantization only. As it turns out the 'random rotation' baseline they benchmark against worked great for my use case, so I never tried implementing the fancier algorithm. But it's a pretty rare day at work that "apply a random rotation matrix to a 128-dimensional vector" is the solution to my problem. [1] https://ieeexplore.ieee.org/abstract/document/6296665 / https://slazebni.cs.illinois.edu/publications/ITQ.pdf |
Funny enough, if you visualize a vector-embedding's latent-space features using that "points on the surface of a hypersphere" analogy that ML programmers like to use — and you assume a really low quantization, say, 1-bit — then you can almost picture the hypersphere surface as a black-and-white vector image, the points as arbitrary-precision vector positions where you want to place dots... and your goal as quantizing those positions to reduce the storage costs down to storing a raster bitmap.
And that problem has a name: dithering!
Oddly enough, for what may or may not be coincidental reasons, what we want in ML terms (keeping the learned associational weights between features constant) is very similar to what we want from the output of image dithering: to not allow the dots to come together to create false features or false voids.
And how do we do that? In dithering, we usually apply a set of random perturbations to the vectorized points. Which, for image dithering, just look like translations in 2D space... but, in a higher-dimensional space, might very well best be analytically modelled as rotations about the origin!