| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by cproctor 691 days ago
	One thing I've wondered for a while: Is there a principled reason (e.g. explainable in terms of embedding training) why a vector's magnitude can be ignored within a pretrained embedding, such that cosine similarity is a good measure of semantic distance? Or is it just a computationally-inexpensive trick that works well in practice? For example, if I have a set of words and I want to consider their relative location on an axis between two anchor words (e.g. "good" and "evil"), it makes sense to me to project all the words onto the vector from "good" to "evil." Would comparing each word's "good" and "evil" cosine similarity be equivalent, or even preferable? (I know there are questions about the interpretability of this kind of geometry.)

4 comments

Scene_Cast2 690 days ago

Some embedding models are explicitly trained on cosine similarity. Otherwise, if you have a 512D vector, discarding magnitude is like discarding just a single dimension (i.e. you get 511 independent dimensions).

link

extasia 690 days ago

This is not quite right; you are actually losing information about each of the dimensions and your mental model of reducing the dimensionality by one is misleading.

Consider [1,0] and [x,x] Normalised we get [1,0] and [sqrt(.5),sqrt(.5)] — clearly something has changed because the first vector is now larger in dimension zero than the second, despite starting off as an arbitrary value, x, which could have been smaller than 1. As such we have lost information about x’s magnitude which we cannot recover from just the normalized vector.

link

Scene_Cast2 690 days ago

Well, depends. For some models (especially two tower style models that use a dot product), you're definitely right and it makes a huge difference. In my very limited experience with LLM embeddings, it doesn't seem to make a difference.

link

extasia 690 days ago

Interesting, I hadn’t heard of two tower modes before!

Yes, I guess it’s curious that the information lost doesn’t seem very significant (this also matches my experience!)

link

Scene_Cast2 690 days ago

Two tower models (and various variants thereof) are popular for early stages of recommendation system pipelines and search engine pipelines.

link

atorodius 690 days ago

That‘s exactly the point no? We lost 1 dim (magnitude). Not so nice in 2d but no biggie in 512d

link

extasia 690 days ago

Magnitude is not a dimension, it’s information about each value that is lost when you normalize it. To prove this normalize any vector and then try to de-normalize it again.

link

kccqzy 690 days ago

Magnitude is a dimension. Any 2-dimensional vector can be explicitly transformed into the polar (r, theta) coordinate system where one of the dimensions is magnitude. Any 3-dimensional vector can be transformed into the spherical (r, theta, phi) coordinate where one of the dimensions is magnitude. This is high school mathematics. (Okay I concede that maybe the spherical coordinate system isn't exactly high school material, then just think about longitude, latitude, and distance from the center.)

link

immibis 690 days ago

Impossible because... you lost a dimension.

link

extasia 690 days ago

That’s not mathematically accurate though, is it? We haven’t reduced the dimension of the vector by one.

Pray tell, which dimension do we lose when we normalize, say a 2D vector?

link

ekianjo 690 days ago

you dont lose anything when you normalize things. not sure what you are tallking about.

link

renewiltord 690 days ago

There's something wrong with the picture here but I can't put my finger on it because my mathematical background here is too old. The space of k dimension vectors all normalized isn't a vector space itself. It's well-behaved in many ways but you lose the 0 vector (may not be relevant). Addition isn't defined anymore, and if you try to keep it inside by normalization post addition, distribution becomes weird. I have no idea what this transformation means for word2vec and friends.

But the intuitive notion is that if you take all 3D and flatten it / expand it to be just the surface of the 3D sphere, then paste yourself onto it Flatland style, it's not the same as if you were to Flatland yourself into the 2D plane. The obvious thing is that triangles won't sum to 180, but also parallel lines will intersect, and all sorts of differing strange things will happen.

I mean, it might still work in practice, but it's obviously different from some method of dimensionality reduction because you're changing the curvature of the space.

link

kccqzy 690 days ago

The space of all normalized k-dimensional vector is just a unit k-sphere. You can deal with it directly, or you can use the standard inverse stereographic projection to map every point (except for one) onto a plane.

> triangles won't sum to 180

Exactly. Spherical triangles have the sum of their interior angles exceed 180 degrees.

> parallel lines will intersect

Yes because parallel "lines" are really great circles on the sphere.

link

renewiltord 690 days ago

So is it actually the case that normalizing down and then mapping to the k-1 plane yields a useful (for this purpose) k-1 space? Something feels wrong about the whole thing but I must just have broken intuition.

link

nostrademons 690 days ago

So I first learned about cosine similarity in the context of traditional information retrieval, and the simplified models used in that field before the development of LLMs, TensorFlow, and large-scale machine learning might prove instructive.

Imagine you have a simple bag-of-words model of a document, where you just count the number of occurrences of each word in the document. Numerically, this is represented as a vector where each dimension is one token (so, you might have one number for the word "number", another for "cosine", another for "the", and so on), and the magnitude of that component is the count of the number of times it occurs. Intuitively, cosine similarity is a measure of how frequently the same word appears in both documents. Words that appear in both documents get multiplied together, but words that are only in one get multiplied by zero and drop out of the cosine sum. So because "cosine", "number", and "vector" appear frequently in my post, it will appear similar to other documents about math. Because "words" and "documents" appear frequently, it will appear similar to other documents about metalanguage or information retrieval.

And intuitively, the reason the magnitude doesn't matter is that those counts will be much higher in longer documents, but the length of the document doesn't say much about what the document is about. The reason you take the cosine (which has a denominator of magnitude-squared) is a form of length normalization, so that you can get sensible results without biasing toward shorter or longer documents.

Most machine-learned embeddings are similar. The components of the vector are features that your ML model has determined are important. If the product of the same dimension of two items is large, it indicates that they are similar in that dimension. If it's zero, it indicates that that feature is not particularly representative of the item. Embeddings are often normalized, and for normalized vectors the fact that magnitude drops out doesn't really matter. But it doesn't hurt either: the magnitude will be one, so magnitude^2 is also 1 and you just take the pair-wise product of the vectors.

link

d110af5ccf 690 days ago

> the reason the magnitude doesn't matter is that those counts will be much higher in longer documents ...

To be a bit more explicit (of my intuition). The vector is encoding a ratio, isn't it? You want to treat 3:2, 6:4, 12:8, ... as equivalent in this case; normalization does exactly that.

link

marginalia_nu 691 days ago

Dunno if I have the full answer, but it seems in high dimensional spaces, you can typically throw away a lot of information and still preserve distance.

The J-L lemma is at least somewhat related, even though it doesn't to my understanding quite describe the same transformation.

https://en.m.wikipedia.org/wiki/Johnson%E2%80%93Lindenstraus...

link

magicalhippo 690 days ago

I've been wondering the same.

When I dabbled with latent semantic indexing[1], using cosine similarity made sense as the dimensions of the input vectors were words, for example a 1 if a word was present or 0 if not. So one would expect vectors that point in a similar direction to be related.

I haven't studied LLM embedding layers in depth, so yeah been wondering about using certain norms[2] instead to determine if two embeddings are similar. Does it depends on the embedding layer for example?

Should be noted it's been many years since I learned linear algebra, so getting somewhat rusty.

[1]: https://en.wikipedia.org/wiki/Latent_semantic_analysis

[2]: https://en.m.wikipedia.org/wiki/Norm_(mathematics)

link