Hacker News new | ask | show | jobs
by panarky 994 days ago
I'd love to know how vector databases compare in their ability to do hybrid queries, vector similarity filtered by metadata values. For example, find the 100 items with the closest cosine similarity where genre = jazz and publication date between 1990 and 2000.

Can the vector index operate on a subset of records? Or when searching for 100 closest matches does the database have to find 1000 matches and then apply the metadata filter, and hope that doesn't reduce the result set down to zero and exclude relevant vectors?

It seems like measuring precision and recall for hybrid queries would be illuminating.

5 comments

There is on-stage filtering approach with extended HNSW https://qdrant.tech/articles/filtrable-hnsw/
I can't speak to the others, but pgvector indices can "break" hybrid queries. For example, if you select using a where clause specifying metadata (where genre = jazz) and order by distance from a vector (embedding of sound clip); if the index doesn't have a lot (or any) vectors in the sphere of the query vector that also match the metadata it can return no results. I discuss this in a blog post here [1].

[1]: https://www.polyscale.ai/blog/pgvector-bigger-boat/

You can totally do this in Cassandra. See https://docs.datastax.com/en/astra-serverless/docs/vector-se...
What you’re describing is easily done in Pinecone, and in other solutions as well. See: https://docs.pinecone.io/docs/metadata-filtering
> do hybrid queries

"no" - the graph objects after training are opaque AFAIK

Actually a lot of the databases offer filtering before or after similarity search.
I'd say it's table stakes today.