Hacker News new | ask | show | jobs
by neoncontrails 831 days ago
> In the following, we show that [taking cosine similarity between two features in a learned embedding] can lead to arbitrary results, and they may not even be unique.

Was uniqueness ever a guarantee? It's a distance metric. It's reasonable to assume that two features can be equidistant to the ideal solution to a linear system of equations. Maybe I'm missing something.

3 comments

It's not even a distance metric, it doesn't obey the triangle inequality (hence the not-technically-meaningful name "similarity", like "collection" as opposed to "set").
1-cos_sim(x,y) is a valid distance metric for L2 normalized vectors, though.
It is a distance metric on a projection.
Not obeying the triangle inequality simply means that it’s fast to compute.
One does not normally consider trigonometric functions fast to compute. Would you mind elaborating?
Consider the geometric definition of the dot product of two vectors,

a.b = |a||b|cos theta.

This means you get cos of the angle between the two vectors by just dividing the dot product by the product of their magnitudes. You don't actually take cos of the angle to get cosine similarity (for one because you don't know the angle) you just use "cos theta" (calculated as above) as a proxy for how narrow the angle is and therefore how close the two embeddings are.

The paper in TFA shows that if you construct an embedding space such that that angle isn't meaningfully measuring similarity then a low angle doesn't mean two things are very similar. I have a similar paper measuring bears and woods but I haven't got around to typesetting it for publication yet.

You don't need to call cosine to compute cosine similarity - most of the time, a normalized dot product is sufficient.

Needless to say, dot products are directly supported in hardware via the FMA unit.

Yes, this would be fast assuming the vectors are pre-normalized, obviating the need to repeatedly calculate two square roots.
Even if the vectors are NOT normalized, you don't always need to normalize them.

Add in nd indices and the costs tend to be very small.

I sure hope noone claimed that. You’re doing potentially huge dimensionality reduction, uniqueness would be like saying you cannot have md5 collisions.
If you have 1000 points and want to preserve their squared distances to within an error of 1%, the Johnson-Lindenstrauss construction suggests an embedding dimension of 8(ln 1000)/(0.01²) > 552620. If your points start out in a lower-dimensional space than that to begin with, it's obviously pointless.

The crossover point where the number of dimensions falls below the number of points is at 1113868. If you're willing to tolerate 10% error, it's at 7094.

I think maybe it's poorly phrased. As far as I can tell, their linear regression example for eq. 2 has an unique solution, but I think they state I that when optimizing for cosine similarity you can find non-unique solutions. But I haven't read in detail.

Then again, you could argue whether that is a problem when considering very high dimensional embeddings. Their conclusions seem to point in that direction but I would not agree on that.