| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by simonw 702 days ago

The biggest one is that it's hard to get "zero matches" from an embeddings database. You get back all results ordered by distance from the user's query, but it will really scrape the bottom of the barrel if there aren't any great matches - which can lead to bugs like this one: https://simonwillison.net/2024/Jun/6/accidental-prompt-injec...

The other problem is that embeddings search can miss things that a direct keyword match would have caught. If you have key terms that are specific to your corpus - product names for example - there's a risk that a vector match might not score those as highly as BM25 would have so you may miss the most relevant documents.

Finally, embeddings are much more black box and hard to debug and reason about. We have decades of experience tweaking and debugging and improving BM25-style FTS search - the whole field of "Information Retrieval". Throwing that all away in favour of weird new embedding vectors is suboptimal.

1 comments

kgeist 701 days ago

>but because embeddings search orders by similarity score it will ALWAYS return results, really scraping the bottom of the barrel if it has to

Why not have a similarity threshold? Say, if the distance is below 0.7, do not accept the search result.

link

simonw 701 days ago

It turns out picking that threshold is extremely difficult - I've tried! The value seems to differ for different searches, so picking eg 0.7 as a fixed value isn't actually as useful as you would expect.

link

zmccormick7 701 days ago

Agreed that thresholds don't work when applied to the cosine similarity of embeddings. But I have found that the similarity score returned by high-quality rerankers, especially Cohere, are consistent and meaningful enough that using a threshold works well there.

link

kgeist 697 days ago

I use similarity threshold (to remove absolutely irrelevant results) and then use a reranker to get Top N.

link

jairuhme 701 days ago

I'll add to what the other commenter noted, but sometimes the difference between results get very granular (i.e. .65789 vs .65788) so deciding on where that threshold should be is little trickier.

link