Hacker News new | ask | show | jobs
by heisenburgzero 973 days ago
Why does the embeddings have linear properties such that you can use functions like cosine similarity to compare? It seems that after the signal going through so many non-linear activation layers, the linear properties should have been broken down / no guarantees.

I wasn't able to find a good answer online.

4 comments

Because neural networks use dot products, which are just un-normalized cosine similarities, as the main way to compare and transform embeddings in their hidden layers. Therefore, it makes sense that the most important signals in the data arranged in latent space such that they are amenable to manipulations based on dot products
For what it's worth, I wonder the same thing and think it's not as obvious as others suggest. e.g. if you have an autoencoder for a one-hot encoding, you're essentially learning a pair of nonlinear maps that approximately invert each other, and that map some high dimensional space to a low one. You could imagine that it could instead learn something like a dense bit packing with a QAM gray code[0]. In a one-hot encoding the dot product for similar tokens is zero, so your transformations can't be learning to preserve it.

Somewhat naively, I might speculate that for e.g. sequence prediction, even if you had some efficient packing of space like that to try to maximally separate individual tokens, it's still advantageous to learn an encoding so that synonyms are clustered so that if there is an error, it doesn't cause mispredictions for the rest of the sequence.

I suppose then the point is that the structure exists in the latent space of language itself, and your coordinate maps pull it back to your naive encoding rather than preserving a structure that exists a priori on the naive encoding. i.e. you can't do dot products on the two spaces and expect them to be related. You need to map forward into latent space and do the dot product there, and that defines a (nonlinear) measure of similarity on your original space. Then the question is why latent space has geometry, and there I guess the point is it's not maximally information dense, so the geometry exists in the redundancy. So perhaps it is obvious after all!

[0] https://en.wikipedia.org/wiki/File:16QAM_Gray_Coded.svg

Thanks, that make sense.

I think my comment was not worded properly. I was thinking "geometry properties = linear properties", what I really should say is:

Why does the latent space has geometry properties where we could use functions like cosine similarity to compare?

So when training, the signal will be mapped to latent space that will minimize the error of the objective function as much as possible.

Many applications already use cosine similarity function at the end the network, it would be obvious why they work. I reviewed other cost functions such as Triplet Loss. They use euclidean distances, so I guess it make sense why the geometry properties exist too.

For "and there I guess the point is it's not maximally information dense, so the geometry exists in the redundancy", what does "maximally information dense" means, I still don't quite get it.

LLM vectors do have decent linear properties already. But for document embedding purposes they are often further trained for retrieval via cosine similarity, which enhances this, e.g. see table 1 in [1], avg retrieval performancs using BERT goes up from 54 to 76 after fine-tuning for embeddings.

[1] https://arxiv.org/pdf/1908.10084.pdf

The cosine similarity is not inherently better suited for linear properties whatever that means, it’s just the cosine of the angle between two vectors. If the vectors are unit length, then it’s just the projection of one to the other.