Hacker News new | ask | show | jobs
by acjohnson55 648 days ago
I might have missed this, but I think the post might bury the lede that in a high dimensional space, two randomly chosen vectors are very unlikely to have high cosine similarity. Or maybe another way to put it is that the expected value of the cosine of two random vectors approaches zero as the dimensionality increases.

Most similarity metrics will be very low if vectors don't even point in the same direction, so cosine similarity is a cheap way to filter out the vast majority of the data set.

It's been a while since I've studied this stuff, so I might be off target.

2 comments

Even if two random vectors don't have high cosine similarity, and I have not had this issue in 3000 dimensions, the cosine similarity is still usable in relative terms, i.e. relative to other items in the dataset. This keeps it useful.
Makes sense. I'm guessing it's one of those things where there's significant info in the magnitude of the exponent, in terms of relative similarity?
Nitpick: The expected value of the cosine is 0 even in low-dimensional spaces. It’s the expected square of that (i.e. the variance) which gets smaller with the dimension.
That totally makes a sense, thanks!