Hacker News new | ask | show | jobs
by godelski 1022 days ago
> The vectors are literally constructed so that cosine similarity is semantic similarity.

Are they? A learned embedding doesn't guarantee this and a positional embedding certainly doesn't. Our latent embeddings don't either unless you are inferring this through the dot product in the attention mechanism. But that too is learned. There are no guarantees that the similarities that they learn are the same things we consider as similarities. High dimensional space is really weird.

And while we're at it, we should mention that methods like t-SNE and UMAP are clustering algorithms not dimensional reduction. Just because they can find ways to cluster the data in a lower dimensional projection (epic mapping) doesn't mean that they are similar in the higher dimensional space. It all depends on the ability to unknot in the higher dimensional space.

It is extremely important to do what the OP is doing and consider the assumptions of the model, data, and measurements. Good results do not necessarily mean good methods. I like to say that you don't need to know math to make a good model, but you do need to know math to know why your model is wrong. Your comment just comes off as dismissive rather than actually countering the claims. There's plenty more assumptions than OP listed too. But their assumptions don't mean the model won't work, it just means what constraints the model is working under. We want to understand the constraints/assumptions if we want to make better models. Large models have advantages because they can have larger latent spaces and that gives them a lot of freedom to unknot data and move them around as they please. But that doesn't mean the methods are efficient.

2 comments

There are embeddings that are trained to reflect similarity, for example SentenceBERT, where the training process pushes pairs of similar sentences (as defined by whoever built the dataset) to have closer embeddings and dissimilar sentences to be further apart.
As the OP points out, Cosine similarity doesn't always equate to relevance. As I was expanding upon, things get really messy as the dimensions increase and your intuition about how vectors relate to one another goes out the window, and fast. Distributional mass is not uniform. Rate of originality increases. And of course, there is no guarantees that latent dimensions align with human meaningful semantic features. There's no pressure to align basis vectors with human perceived semantics. My argument isn't about that there isn't a similarity pressure it's that similarity in high dimensions means different things then similarities in low dimensions. For example, in high dimensions most of a unit cube's mass lies outside the unit sphere, while in 2 or 3 dimensions the unit cube is always contained inside with room to spare. High dimensions are weird and that's what my comment is about because many people are using their lower dimensional intuition for ML.
Do you know how embedding models are trained?
Yes. My comment is about the geometry of higher dimensions and their meanings. These are not the same as in {2,3}D