| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by neoncontrails 831 days ago
	> In the following, we show that [taking cosine similarity between two features in a learned embedding] can lead to arbitrary results, and they may not even be unique. Was uniqueness ever a guarantee? It's a distance metric. It's reasonable to assume that two features can be equidistant to the ideal solution to a linear system of equations. Maybe I'm missing something.

3 comments

nerdponx 830 days ago

It's not even a distance metric, it doesn't obey the triangle inequality (hence the not-technically-meaningful name "similarity", like "collection" as opposed to "set").

link

__tmk__ 830 days ago

1-cos_sim(x,y) is a valid distance metric for L2 normalized vectors, though.

link

epgui 830 days ago

It is a distance metric on a projection.

link

Der_Einzige 830 days ago

Not obeying the triangle inequality simply means that it’s fast to compute.

link

queuebert 830 days ago

One does not normally consider trigonometric functions fast to compute. Would you mind elaborating?

link

seanhunter 830 days ago

Consider the geometric definition of the dot product of two vectors,

a.b = |a||b|cos theta.

This means you get cos of the angle between the two vectors by just dividing the dot product by the product of their magnitudes. You don't actually take cos of the angle to get cosine similarity (for one because you don't know the angle) you just use "cos theta" (calculated as above) as a proxy for how narrow the angle is and therefore how close the two embeddings are.

The paper in TFA shows that if you construct an embedding space such that that angle isn't meaningfully measuring similarity then a low angle doesn't mean two things are very similar. I have a similar paper measuring bears and woods but I haven't got around to typesetting it for publication yet.

link

SR2Z 830 days ago

You don't need to call cosine to compute cosine similarity - most of the time, a normalized dot product is sufficient.

Needless to say, dot products are directly supported in hardware via the FMA unit.

link

queuebert 830 days ago

Yes, this would be fast assuming the vectors are pre-normalized, obviating the need to repeatedly calculate two square roots.

link

SR2Z 830 days ago

Even if the vectors are NOT normalized, you don't always need to normalize them.

Add in nd indices and the costs tend to be very small.

link

pletnes 830 days ago

I sure hope noone claimed that. You’re doing potentially huge dimensionality reduction, uniqueness would be like saying you cannot have md5 collisions.

link

JohnKemeny 830 days ago

Johnson and Lindenstrauss disagree. https://en.wikipedia.org/wiki/Johnson%E2%80%93Lindenstrauss_...

link

yorwba 830 days ago

If you have 1000 points and want to preserve their squared distances to within an error of 1%, the Johnson-Lindenstrauss construction suggests an embedding dimension of 8(ln 1000)/(0.01²) > 552620. If your points start out in a lower-dimensional space than that to begin with, it's obviously pointless.

The crossover point where the number of dimensions falls below the number of points is at 1113868. If you're willing to tolerate 10% error, it's at 7094.

link

SimplyUnknown 830 days ago

I think maybe it's poorly phrased. As far as I can tell, their linear regression example for eq. 2 has an unique solution, but I think they state I that when optimizing for cosine similarity you can find non-unique solutions. But I haven't read in detail.

Then again, you could argue whether that is a problem when considering very high dimensional embeddings. Their conclusions seem to point in that direction but I would not agree on that.

link