Hacker News new | ask | show | jobs
by skeptrune 639 days ago
I tend to think "semantic" is a better term than "similarity" for describing dense vector search. Question to answer pairs are present in the training data which should discourage "similarity" thinking by itself. Personally, it feels like there was a shift to using "similarity" recently and I don't know why.

IDF keyword matching oriented techniques are a lot closer to "similarity" search than dense vector ones.

Dense vector search excels uniquely well for queries containing a semantic concept like a comparison with a query such as "A vs. B" [1]. Results will shift towards comparisons in general over results containing tokens A or B. Bag-of-words model where all the tokens collapse into a single vector is ideal for anything where you're trying to get an idea more than a set of tokens.

> 1. solve this with query expansion, running multiple similarity searches..

It would be a lot easier to solve by just having the required/negated word feature. That's really my point in the blog. Having to experiment and do newfangled things with high latency penalties for this common query pattern is something to seriously consider when choosing pgvector.

> Explainability with highlights

It's rare that a semantic search does not match tokens to some extent. Example imaged in the blog is actually from a dense vector query. Dense vector search isn't that magical and retains some token-level precision. Also, you can use word-level embeddings for the highlights to make them more semantically accurate. However, in our system, we use jaro-winkler distance.

In a semantic search context, you do want something matching on meaning, but when it doesn't work well, the information from highlights helps you refactor and get it closer to ideal.

> Sparse vectors, BM25, and other IDF modes

I'm a fan of pg_search (and tantivy), but SPLADE is a significant improvement and I think it's a big loss to not have easy access to it.

Appreciate that you enjoyed the post and thank you for starting a discussion on it!

[1]: https://hn.trieve.ai/about