|
|
|
|
|
by minimaxir
289 days ago
|
|
It's likely because the definition of "similar" varies, and it doesn't necessarily mean semantic similarity. Depending on how the embedding model was trained, just texts with a similar format/syntax are indeed "similar" on that axis. The absolute value of cosine similarity isn't critical (just the order when comparing multiple candidates), but if you finetune an embeddings model for a specific domain, the model will give a wider range of cosine similarity since it can learn which attributes specifically are similar/dissimilar. |
|