| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by minimaxir 289 days ago
	It's likely because the definition of "similar" varies, and it doesn't necessarily mean semantic similarity. Depending on how the embedding model was trained, just texts with a similar format/syntax are indeed "similar" on that axis. The absolute value of cosine similarity isn't critical (just the order when comparing multiple candidates), but if you finetune an embeddings model for a specific domain, the model will give a wider range of cosine similarity since it can learn which attributes specifically are similar/dissimilar.

1 comments

teepo 289 days ago

Thanks - that helped it click a bit more. If the relative ordering is correct it doesn't matter they look so compressed.

link

clickety_clack 289 days ago

That’s not necessarily true. If the embedding model hasn’t been trained on data you care about, then similarity might be dominated by features you don’t care about. Maybe you want documents that reference pizza toppings, but the embedding similarity might actually be dominated by the tone, word complexity, and the use of em dashes as compared to your prompt. That means the relative ordering might not turn out the way you want.

link