|
The two main classes of approximate k-NN algorithms om high dimensional data would involve either (1) linear table scans (e.g., IVF / cell-probe indexes; brute-force exact indexing or the coarse quantizer for an IVF index) or (2) graph walking (HNSW, NN-Descent, CAGRA, etc). Some other approaches like LSH (depending on how it's represented as a data structure) can be in between the two. For (1), all of the indexing features of a typical RDBMS are overkill for IVF, where the only indexing you need is by IVF bucket, and then you simply sequentially scan and evaluate distances for everything in those buckets. However, if you want to perform filtering based on other criteria (e.g., filtering based on metadata associated with each vector), then multiple column indexes could be useful with a RDBMS, as you can more efficiently extract just the vectors that meet criteria prior to a sequential scan (e.g., give me all vectors which have a date in the range between X and Y and whose name contains the string "red"). For (2), these algorithms would be inefficient to express in a typical RDBMS. While you can easily encode a graph in a RDBMS, walking the graph sequentially based on criteria is another matter, because each time you'd have to hit the disk (or in-memory cache) to find the next row to look at, then repeat. Latency could be an issue unless it's all in memory, and if it's all in memory, a RDBMS-type representation is overkill versus more simple pointer chasing in memory. I guess it also depends upon latency / throughput needs (caching in memory on front of a database could help, but a pure in-memory vector index would be faster), persistence / fault tolerance / replication needs (in memory databases would need some kind of persistent store), or update needs (adding/removing vectors, though adding/removing vectors from an IVF or graph-based index becomes non-trivial if you add or remove too many vectors). Traditional databbases that have been around for a while might provide better guarantees on reliability. (I'm the author of the GPU half of the Faiss library) |
I don't see any fundamental reason why the index in Postgres would be slower than a specialized vector database. The query pattern of the vector database is simply a point query using an index, similar to other queries in an OLTP system.
The only limitation I see is scalability. It's not easy to make PostgreSQL distributed, but solutions like Citus exist, making it still possible.
(I'm the author of pgvecto.rs)