Hacker News new | ask | show | jobs
by jhj 892 days ago
I don't know why PQ is listed as an "indexing strategy". It's a vector compression/quantization technique, not a means of partitioning the search space. You could encode vectors with PQ when using brute-force/flat index, an IVF index, with HNSW (all of which are present in Faiss with PQ encoding as IndexPQ, IndexIVFPQ and IndexHNSWPQ respectively), or even k-D trees or ANNOY if someone wanted to do that.

"Use HNSW or Annoy for very large datasets where query speed is more important than precision": Graph-based methods have huge memory overhead and construction cost, and they aren't practical for billion-scale datasets. Also, they will usually be more accurate and faster than IVF techniques (as you would need to visit a large number of IVF cells to get comparable accuracy), though IVF can scale to trillion-sized databases without much overhead yet with reasonable speed/accuracy tradeoffs unlike other techniques. I'd say "use for medium-scale datasets where query speed is important, yet high accuracy is still desired and flat/brute-force indexing is impractical".

1 comments

It turns a continuous space into a discrete space. You would first do PQ, then do KNN of the new discrete vector. This way you can compress the vocabulary to a fixed size.
The implementations that I am aware of, including the one in Faiss which I wrote (described in detail in https://arxiv.org/abs/1702.08734), do not index the vector based on its PQ encoding (e.g., in IVFPQ). The IVF cell chosen in which to put the vector is based on its pre-quantized (full precision floating point) representation. It would lose too much precision to perform all comparisons in the compressed space.

Also, distance comparisons are usually of the "ADC" (asymmetric distance comparison) form: the query vector is in full floating point format, and vectors, if quantized/compressed in the database, are effectively decompressed and compared in the full floating point domain. This is true even with PQ, as the distance between the query vector subspaces with each of the 2^n PQ codes for the same subspace are precomputed before comparison (and then the distance computation becomes a lookup-add based on the precomputed distance tables).

LSH techniques unlike PQ are more accurately described as an indexing technique, since the buckets into which vectors are placed are based on their encoding (via hashing/binarization) and are fully compared in that space.