| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by gavinray 84 days ago

Can someone ELI5 these two concepts please, which make no sense to me:

  > "TurboQuant starts by randomly rotating the data vectors. This clever step simplifies the data's geometry"

I don't understand how taking a series of data and applying a random rotation could mathemetically lead every time to "simpler" geometry.

If I throw a bunch of shapes on the ground, tightly packed and touching each other, then rotate all of them, you can't guarantee that the new conglomerate shape is any more/less "simple" than before, right?

  > "Johnson-Lindenstrauss Transform to shrink complex, high-dimensional data while preserving the essential distances and relationships between data points. It reduces each resulting vector number to a single sign bit (+1 or -1)."

How can a boolean value preserve all of the relational and positional information between data points?

7 comments

kingstnap 84 days ago

Other people have answered here but the real answer is that deep neural networks don't learn isotropic distributions of activations.

What happens is that you get very spikey activations, there are so called "outlier" activations. A easy to read paper that tells you about this is SmoothQuant [0]. Another source from Anthropic and the Mechanistic Interperability people is calling these "privileged basis" [1].

Now based on the weight symmetries of a typical transformer, these actually don't need to exist. Weight symmetries means the ways you can change the weights without actually affecting the mathematical function, there are a broad class of these because the linear algebra has a lot of redundancies in it.

But the behaviour of the Adam optimizer is such that you do end up w/ these things because it sort of more quickly optimizes to produce them. This comes from the fact it is an elementwise dynamic learning rate (and probably partly to do with the epsilon).

[0] https://arxiv.org/pdf/2211.10438 [1] https://transformer-circuits.pub/2023/privileged-basis/index...

gavinray 84 days ago

From your second paper:

  > In particular, we can generate fixed random rotation matrices at initialization, and multiply them into the activations any time we read from or write to the residual stream.

I guess I was mistaken in assuming this part was part of the TurboQuant-specific innovations. Still an interesting concept though

Bolwin 84 days ago

Do you know if this also applies to the muon optimizer? It seems to be replacing adamw

kingstnap 84 days ago

My guess is that probably not for Muon. What I said about ADAM was partly based on this blogpost I read some time ago, should have cited it as well [0].

The thing about Muon is that it doesn't have this specific feature of ADAM that causes it to "move along the diagonal". Basically if you flatten weights as a huge vector of a few billion elements. SGD moves along the gradient, which isn't biased. ADAM normalizes everything elementwise, so it sort of moves along a vector of +-1.

This isn't a proof or anything, but what you can imagine might be happening is that if you move along +-1, then you find spikey solutions somehow. Not sure how to prove that. Muon doesn't really do this, but it has its own sort of funky reshaping of the update (it moves along low rank directions).

[0] https://www.lesswrong.com/posts/yrhu6MeFddnGRSLtQ/adam-optim...

lumost 84 days ago

They are saying that models should be invariant to data's orientation - and only sensitive to the distance between vectors. This has a pretty significant effect on reducing the set of possible models, and may stabilize the optimization.

In simple terms, large ML models like LLMs often learn trivial rules such as "if the 21st decimal place of the 5th dimension in the embedding vector is 5 - then the image is of a cat." Learning such a memorization function is usually not what we are trying to do, and there are a variety of techniques to avoid these trivial solutions and "smooth" the optimization geometry.

photon_lines 84 days ago

The whole goal of quantisation is to put the data into 'bins' so that it can easily be 'packed' so that you can represent it using less bits (less information). You can think of it like rounding essentially (3.14159 -> 3). Now, sometimes within data, the distribution will be non-ideal for separating it out into bins (let's say that our rounding rules are simple -- we simply use a floor function so 2.45 maps to 2 and 6.4543 maps to 6 etc...) and our bins simply map to the floor -- if we had a set of numbers which look like this: [3.11, 4.43, 5.78, 12.33, 34.32], they would simply map to [3, 4, 5, 12, 34]. Now, we have one huge outlier in our data (34) so to create bins for those sets of numbers, we would need 6 bits of information (2 to the power of 6 = 64), but this is mostly due to the fact that we have one huge outlier (34.32). To get rid of this -- the algorithms applies a random rotation matrix which 'distorts' the original data so that it is more evenly distributed among the possible bins which are assigned to the data set. In linear algebra, a rotation matrix is an orthogonal matrix. When you multiply your vector by this matrix, you aren't changing the "amount" of data (the length of the vector remains the same), but you are recalculating every single number in that vector as a weighted sum of the originals. According to the Central Limit Theorem, when you sum up many random things, the result always starts looking like a bell curve. This is the magic TurboQuant relies on: they don't know what your data looks like, but they know that after the rotation, the data must look like a Beta Distribution and they use this fact to transform the original data into a more 'tightly packed' distribution which allows them to more efficiently pack (or quantise) the information. If most of the transformed data is huddled together into a predictable Bell curve shape, you can pack your bins tightly around that shape leading to much higher precision with fewer needed bits to store it. For example, after applying a rotation matrix, our original transform [3.11, 4.43, 5.78, 12.33, 34.32] might get mapped to something like [8.12, 8.65, 9.25, 10.53, 12.86] and we can crate bins which both are more accurate and need less bits in order to hold our original data set. To create the most optimal bins -- the Lloyd-Max algorithm is used. This algorithm is the gold standard for 1D quantisation. Its goal is to find the best places to put your "boundaries" (where you cut the data) and your "reconstruction values" (the number you store) to minimise the Mean Squared Error (MSE). After applying this, you have your 'rounded' values (or quantized data), but there is still an error value which is missing from our data set: and this is where the residual bit comes in. That bit doesn't represent the original data (or vector) - it simply represents our 'bias' after we apply the above algorithms. It's basically like a '1-bit note' which allows you to perfectly cancel out all the bias terms which our above quantisation algorithm produces to make the 'interactions' (or inner products) when we multiply our values together extremely accurate again even after transforming our original data. Does this make sense?

nico 84 days ago

Amazing explanation! Thank you so much for taking the time to put it together. It makes a lot of sense. I’m not the one who asked the question, but I was impressed by such eloquent and clearly explained answer

photon_lines 83 days ago

Thank you! I'm glad you found it helpful (and that others did too)!!

thrtythreeforty 82 days ago

This is a fantastic explanation. Thank you. The only part I am not following is how it is guaranteed that 1 bit is sufficient for the error value. Is this something the Lloyd-Max algorithm is responsible for ensuring? (Seems to me that if your quantization algorithm is crappy enough, you could need a large number of bits to store the error.)

rtrgrd 84 days ago

Added to my non-llm username list :)

Thanks so much for the explanation

psidium 81 days ago

Wow, thank you for the explanation. Such a complex topic and yet you’ve made it simple to understand.

functional_dev 81 days ago

i wonder what is the limit of quantization when it starts to destroy the logic of weights?

gavinray 84 days ago

I had to read this over a few times to piece it together, thanks for the thorough and digestable explanation!

rohansood15 84 days ago

Thank you.

gopalv 84 days ago

> I don't understand how taking a series of data and applying a random rotation could mathemetically lead every time to "simpler" geometry.

Let's pick a simpler compression problem where changing the frame of reference improves packing.

There's a neat trick in the context of floating point numbers.

The values do not always compress when they are stored exactly as given.

[0.1, 0.2, 0.3, 0.4, 0.5]

Maybe I can encode them in 15 bytes instead of 20 as float32.

Up the frame of reference to be decibels instead of bels and we can encode them as sequential values without storing exponent or sign again.

Changing the frame of reference, makes the numbers "more alike" than they were originally.

But how do you pick a good frame of reference is all heuristics and optimization gradients.

redanddead 82 days ago

AI and graphics are matrices

Matrices are numbers [x,y,z]

GPUs are matrix processing units

Models are big matrices, we quantize them to make them small. That is lossy. Makes AI dumber the harder you quantize but lets you run inference with lesser hardware

What if you could quantize less destructively/lossy? You could make a model way smaller or make much bigger models that run on less RAM

That is what they achieved here. They're not saying that multiplying the matrices with scalars up or down helps. They're saying that by mutating and transforming the matrix with a function (ie. rotating the dimensions by the same "random" rotation) you have matrices that make smarter models fit in smaller boxes, needing way less RAM to achieve the same performance

If we quantized it as aggressively as we would have without the distribution/mutation function, the drop in benchmarks would be even more noticeable

It's actually a huge breakthrough and commercially its probably only a short term loss in valuation for the manufacturers

wordpad 84 days ago

They are not doing random rotation, simplification here means they are aligning the outliers. If you threw a bunch of shapes on the ground they are picking up one that rolled away and putting it with the others.

>How can a boolean value preserve all of the relational and positional information between data points?

They aren't reducing entire vector to a bollean only each of its dimensions.

elif 84 days ago

i could be mistaken but from my read, the 'rotation' aspect is nothing new and not dissimilar from normal spin quant, where the importance matrix is rotated during calibration such that the local minima/maxima are more evenly smoothed and excessive/redundant quantization of parameters is avoided.

as for the J-L transformation is way above my head so i'm almost certainly mistaken but it seems to be some clever way to use a bit as a sort of pointer in order to reuse existing chunks of parameter weight data like in a jpeg or zip compression algorithm.