Hacker News new | ask | show | jobs
by seanhunter 618 days ago
It pains me that the author (and many others) refer to cosine similarity as a distance metric. The cosine of an angle is a quick measure of how close the direction between two vectors is, however it is not a distance metric (which would measure the distance between their endpoint).

Thought experiment: If I go outside and point at the moon, the cosine of the angle between the position vector of my finger and the position vector of the moon relative to me is 1.[1] However my finger is very much not on the moon. The distance between the two vectors is very large even though their angle is zero.

That's why it's cosine similarity not cosine distance. If your embedding methodology is trained such that the angle between vectors is a good enough proxy for distance then it will be[2]. But it's not a distance measure.

[1] because the angle is 0 and cos 0 = 1.

[2] A self-fulfilling prophesy but this actually is in the power of the people making the embedding to make true, presumably because the training will disperse the embeddings such that their magnitude is roughly equal so you'll have a kind of high-dimensional sphere of embeddings with most of the actual vectors ending on the outside of the sphere and not too many points far on the interior and not too many points spiking way out the sides. It seems OpenAI also normalize all the vectors so they are all unit vectors so the magnitude doesn't matter. But it's still not a distance measure.

4 comments

On point.

Just because the L2-norm yields the same rankings as cosine similarity for the particular case of normalized embeddings when retrieving relevant documents doesn't mean that any other L-norm or commonly used measure in the field of (un)supervised learning or information retrieval presents itself as a viable alternative for the problem at hand — which, by the way, had to be guessed too.

Looking at the "history" of this development (e.g. bag-of-words model, curse of dimensionality etc.) provides a solid explanation for why we've ended up using embeddings and cosine similarity for retrieval.

Though, I'm curious to see any advancements and new approaches in this area. This might sound snarky, but I still commend the author for doing what I wasn't able to by now: writing down their view of the world and putting it out for public scrutiny.

BTW the author mentions Mahalanobis distance[1]. This is a good one to know about but it isn't useful in this application. As I understand it (having used it a bit and even implemented the algorithm a couple of times) Mahalanobis distance multiplies by the inverse of the covariance matrix. What that does is essentially undo covariance if the dimensions of your space are not equivalent. So if moving in one dimension is highly correlated with moving in another dimension, Mahalanobis distance corrects for that effect.

[1] https://en.wikipedia.org/wiki/Mahalanobis_distance

Another formulation of this which I like is that the Mahalanoubis distance measures “how many standard deviations apart” two state vectors are.

Note that since the covariance matrix has variance on the diagonal, it corrects not only for correlations but also normalizes each dimension using their standard deviation.

You can recover a distance metric from the cosine similarity of unit (i.e. normalized) vectors by taking their Euclidean distance, which can be written as basically a square root of the complement of cosine similarity. Or you can just take the complement and forget the square root, which isn't technically a distance metric but might be good enough. Or you can invert cosine similarity to get angular distance, which is a true distance metric but might be too expensive.
Both chord length and arc length can be good ones. For some purposes versine (1 - cosine, sometimes called the "normal distance") and the half-tangent of arc length (i.e. the distance to one point when stereographically projected so the center of projection is the other point) are also useful types of quasi-distance. https://davidhestenes.net/geocalc/pdf/CompGeom-ch3.pdf
Isn't that normalization in high dimensions pretty much the whole point?

You don't care about differing radius at all, and the angle between the unit radial vectors directly correlates to the "great circle distance" between the points on the surface, right?

Correct but that doesn't make cosine similarity a distance metric. It just means n this particular case arc length is your distance metric and you're using trigonometry to get it.