| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jhj 938 days ago

The two main classes of approximate k-NN algorithms om high dimensional data would involve either (1) linear table scans (e.g., IVF / cell-probe indexes; brute-force exact indexing or the coarse quantizer for an IVF index) or (2) graph walking (HNSW, NN-Descent, CAGRA, etc). Some other approaches like LSH (depending on how it's represented as a data structure) can be in between the two.

For (1), all of the indexing features of a typical RDBMS are overkill for IVF, where the only indexing you need is by IVF bucket, and then you simply sequentially scan and evaluate distances for everything in those buckets.

However, if you want to perform filtering based on other criteria (e.g., filtering based on metadata associated with each vector), then multiple column indexes could be useful with a RDBMS, as you can more efficiently extract just the vectors that meet criteria prior to a sequential scan (e.g., give me all vectors which have a date in the range between X and Y and whose name contains the string "red").

For (2), these algorithms would be inefficient to express in a typical RDBMS. While you can easily encode a graph in a RDBMS, walking the graph sequentially based on criteria is another matter, because each time you'd have to hit the disk (or in-memory cache) to find the next row to look at, then repeat. Latency could be an issue unless it's all in memory, and if it's all in memory, a RDBMS-type representation is overkill versus more simple pointer chasing in memory.

I guess it also depends upon latency / throughput needs (caching in memory on front of a database could help, but a pure in-memory vector index would be faster), persistence / fault tolerance / replication needs (in memory databases would need some kind of persistent store), or update needs (adding/removing vectors, though adding/removing vectors from an IVF or graph-based index becomes non-trivial if you add or remove too many vectors). Traditional databbases that have been around for a while might provide better guarantees on reliability.

(I'm the author of the GPU half of the Faiss library)

4 comments

VoVAllen 938 days ago

It's kind of a tradeoff. Performance is just one factor when choosing the vector database. In pgvecto.rs https://github.com/tensorchord/pgvecto.rs, we store the index separately from PostgreSQL's internal storage, unlike pgvector's approach. This enable us to get multi-threaded indexing, async indexing without blocking the insertion, and faster search speed comparing to pgvector.

I don't see any fundamental reason why the index in Postgres would be slower than a specialized vector database. The query pattern of the vector database is simply a point query using an index, similar to other queries in an OLTP system.

The only limitation I see is scalability. It's not easy to make PostgreSQL distributed, but solutions like Citus exist, making it still possible.

(I'm the author of pgvecto.rs)

link

anonymousDan 938 days ago

Would a graph database (e.g. Neo4j) be any better?

link

toasted-subs 938 days ago

I cannot think of a situation where neo4j does anything better than traditional relational databases

link

inkyoto 938 days ago

Graph databases, neo4j including, handle many-to-many relationships better than any RDBMS.

link

simonw 938 days ago

In what way do they handle them better?

link

activescott 938 days ago

Very useful for me to read this (and look up some of the mentioned concepts). Thank you!

link

simonw 938 days ago

Thanks, that was really interesting!

link