|
|
|
|
|
by Xenoamorphous
282 days ago
|
|
Question for the experts: a few years back (even before covid times?) I was tasked with building a news aggregator. Universal Sentence Encoder was new, and we didn’t even have BERT back then. It felt magical (at least as a regular software dev) seeing how the cosine similarity score was heavily correlated with how similar (from a semantic standpoint) two given pieces of text were. That plus some clustering algorithm got the job done. A few months ago I happened to play with OpenAI’s embeddings model (can’t remember which ones) and I was shocked to see that the cosine similarity of most texts was super close, even if the texts had nothing in common. It’s like the wide 0-1 range that USE (and later BERT) were giving me was compressed to perhaps a 0.2 one. Why is that? Does it mean those embeddings are not great for semantic similarity? |
|
The absolute value of cosine similarity isn't critical (just the order when comparing multiple candidates), but if you finetune an embeddings model for a specific domain, the model will give a wider range of cosine similarity since it can learn which attributes specifically are similar/dissimilar.