Hacker News new | ask | show | jobs
by derefr 606 days ago
> But it's a pretty rare day at work that "apply a random rotation matrix to a 128-dimensional vector" is the solution to my problem.

Funny enough, if you visualize a vector-embedding's latent-space features using that "points on the surface of a hypersphere" analogy that ML programmers like to use — and you assume a really low quantization, say, 1-bit — then you can almost picture the hypersphere surface as a black-and-white vector image, the points as arbitrary-precision vector positions where you want to place dots... and your goal as quantizing those positions to reduce the storage costs down to storing a raster bitmap.

And that problem has a name: dithering!

Oddly enough, for what may or may not be coincidental reasons, what we want in ML terms (keeping the learned associational weights between features constant) is very similar to what we want from the output of image dithering: to not allow the dots to come together to create false features or false voids.

And how do we do that? In dithering, we usually apply a set of random perturbations to the vectorized points. Which, for image dithering, just look like translations in 2D space... but, in a higher-dimensional space, might very well best be analytically modelled as rotations about the origin!

6 comments

Another way to understand dithering is by smearing the frequency spectrum of the original image preventing extreme frequency values to distort the image after quantization - this can be done by applying kernel filters on the original image.

Which I think is what is happening with SpinQuant as well - a smoothing of the frequency spectrum of the model weights, confirmed by the smearing of the singular values of the weight matrices.

Fascinating! Does that mean you could improve performance further with Floyd–Steinberg dithering? (I.e. instead of rotating randomly, you track accumulated quantization error and add that amount instead.)
Floyd-Steinberg etc mostly look better to the human eye, but I'm not sure in what more 'objective' sense they would be better than random dithering?
Floyd-Steinberg is one sort of quasi-random algorithm, but there are others. People often use quasi-random rather than true randomness when they want to avoid sample points bunching together. They tend to be more evenly distributed. That can get more important in higher-dimension space where it's easy to completely miss sampling large volumes because a truly random point set has too many degrees of freedom.
Interesting.

What you are describing reminds me of Low discrepancy sequences: https://en.wikipedia.org/wiki/Low-discrepancy_sequence

Though these methods have their problems and blind-spots, too, and are often outdone by random sampling with even slightly higher sample count, while preserving all the simplicity and (statistical) guarantees you get from randomness.

But images have regular adjacent pixels to work with. Don't think the algo can be straight applied to irregularly placed points in manydimensional space.
The best type of dithering is done with error diffusion. There's a convolutional kernel the diffuses the error over multiple adjacent data points.
I'm just on the edge of understanding this but if I'm visualizing this right you're talking about a point source at the center of a sphere and a bitmap indicating where all the vectors intersect the surface. But that would mean the lengths would all be the same.

Isn't it the lengths/distances to neighbors that is the main information being stored in a vector db? Or is it just that what you're talking about only concerns the angles so the lengths are not part of the discussion?

I'm a dev but still have a lot to learn about ML :)

My understanding is that yes, it actually is normalized to have the lengths all be the same, and thus the angle from (hyperdimensional) 0,0,0,(...n) is all that matters. The "distance between two embeddings" is able to simply to cosign of the two angles.
Seems really intriguing could you help me grok how this random perturbations of the points of the hypersphere surface are related to smearing the model weights?
I'm sorry, I don't understand the language you're speaking. English please?

(Just kidding - but if you have any recommendations for learning resources to get started being able to understand what you're talking about, I'd greatly appreciate it.)

One of the fun things about signals theory is how the same basic concept will show up in apparently unrelated places.

Example from electrical engineering: microprocessors will have a "clock" frequency, say, 16Mhz. But when you haul a wire up to VCC and pull it back down to ground, some amount of the power will be radiated away as radio waves. If your clock is at a constant rate, then you'll have a big spike of radiated noise at 16MHz, and the FCC will be unhappy.

So modern devices cheat it by dithering around the central frequency. If you bounce from 15.9998MHz to 16.001 to 15.998 then the same amount of power will be radiated, but smeared across a bigger frequency, enough to get you lower than the regulatory threshold. Spread spectrum clock generation. https://www.analog.com/en/resources/technical-articles/clock...

If you look in your PC's BIOS settings, spread spectrum is usually an option, and you can disable it if you want your computer to be slightly noisier.

in 2024 you paste the dense comment into your favorite LLM (preferably multiple) and ask it to explain on your desired level (whatever that may be). works remarkably well for every topic I tried it with (e.g. jargon-heavy financial tweets.)