Hacker News new | ask | show | jobs
by ntonozzi 618 days ago
One important factor this article neglects to mention is that modern text embedding models are trained to maximize distance of dissimilar texts under a specific metric. This means that the embedding vector is not just latent weights plucked from the last layer of a model, but instead specifically trained to be used with a particular distance function, which is the cosine distance for all the models I'm familiar with.

You can learn more about how modern embedding models are trained from papers like Towards General Text Embeddings with Multi-stage Contrastive Learning (https://arxiv.org/abs/2308.03281).

1 comments

Yes, exactly. It’s one of the reasons that an autoencoder’s compressed representation might not work that well for similarity. You need to explicitly push similar examples together and dissimilar examples apart, otherwise everything can get smashed close together.

The next level of understanding is asking how “similar” and “dissimilar” are chosen. As an example, should texts about the same topic be considered similar? Or maybe texts from the same user (regardless of what topic they’re talking about)?