| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by acjohnson55 648 days ago

I might have missed this, but I think the post might bury the lede that in a high dimensional space, two randomly chosen vectors are very unlikely to have high cosine similarity. Or maybe another way to put it is that the expected value of the cosine of two random vectors approaches zero as the dimensionality increases.

Most similarity metrics will be very low if vectors don't even point in the same direction, so cosine similarity is a cheap way to filter out the vast majority of the data set.

It's been a while since I've studied this stuff, so I might be off target.

2 comments

OutOfHere 648 days ago

Even if two random vectors don't have high cosine similarity, and I have not had this issue in 3000 dimensions, the cosine similarity is still usable in relative terms, i.e. relative to other items in the dataset. This keeps it useful.

link

acjohnson55 647 days ago

Makes sense. I'm guessing it's one of those things where there's significant info in the magnitude of the exponent, in terms of relative similarity?

link

DavidSJ 648 days ago

Nitpick: The expected value of the cosine is 0 even in low-dimensional spaces. It’s the expected square of that (i.e. the variance) which gets smaller with the dimension.

link

acjohnson55 648 days ago

That totally makes a sense, thanks!

link