|
All approximate k-NN algorithms involve tradeoffs, with the main tradeoffs being usually fidelity (recall, vector reconstruction error, end-to-end application error with respect to some metric etc), speed (to query, in either serial or batch mode), and memory usage (vector compression, indexing overhead, etc). There is no one algorithm which is "better" than others along all dimensions; instead, for a given setting you look and see which technique (or combination of techniques) would let you lie along the Pareto-optimal tradeoff curve between fidelity/speed/memory, and for the application, you decide which of these is most important. LSH in industrial practice is only really useful for incredibly high-dimensional (think 100K+ dimensions) and/or massively sparse vectors (e.g., only 1K of those 100K dimensions have a non-zero value), or if latency is a very huge concern (e.g., online approximate self-attention of vectors). For dense vectors around 20-1000 dimensions, other techniques are better. The quality of results via LSH leaves much to be desired so I have not seen it used in most modern settings, but there is a huge literature on it as it is easier to prove things mathematically regarding its properties. Graph-based methods (e.g., HNSW): These work best for intermediate sized vector databases (100K - 10M vectors or so) as they are very fast and have high quality of results, but the memory overhead is high (indexing structures). It is impractical to use it for massive-scale indexes (think 1 billion+ vectors, unless you shard the database into smaller pieces and query all together, but this is computationally expensive. If the query vector distribution differs substantially from the database distribution you tend to run into problems here too. Cell-probe based methods (e.g., IVF): These are used for the largest databases (e.g., 1 billion - 1 trillion+ vectors; at least those where you are not simply sharding the database and still searching every shard) as they scale pretty well. The quality of results can leave a lot to be desired though, as the partitioning of the database into geometric cells and only searching some subset (n IVF cells, only searching ~sqrt(n) of those) will tend to have a hard upper limit on finding the true nearest neighbor (recall will saturate at around 30%, say). IVF is the foundation of Google ScaNN and is used heavily at FB/Meta as well. These choices are independent of how vectors are encoded in the database (e.g., using product quantization, scalar quantization, raw floating point vectors, binarized vectors, etc). There are some vector search problems (e.g., MIPS, or maximum inner product search, useful in recommendation problems or for things like approximate matrix multiplication/approximate self-attention) which still have no good solution (there is no metric space/geometry that makes sense here, and the query distribution usually varies quite widely from the database distribution). "Neural search" or other ML-assisted search techniques (given a query vector, what partition of the database should you look in?) are improving over time as well. This new paper of ours has a good overview of the tradeoffs involved
https://arxiv.org/pdf/2401.08281.pdf (I am the author of the GPU half of the Faiss library) |