Hacker News new | ask | show | jobs
by peter_l_downs 744 days ago
> JVector, the library that powers DataStax Astra vector search, now supports indexing larger-than-memory datasets by performing construction-related searches with compressed vectors. This means that the edge lists need to fit in memory, but the uncompressed vectors do not, which gives us enough headroom to index Wikipedia-en on a laptop.

It's interesting to note that JVector accomplishes this differently than how DiskANN described doing it. My understanding (based on the links below, but I didn't read the full diff in #244) is that JVector will incrementally compress the vectors it is using to construct the index; whereas DiskANN described partitioning the vectors into subsets small enough that indexes can be built in-memory using uncompressed vectors, building those indexes independently, and then merging the results into one larger index.

OP, have you done any quality comparisons between an index built with JVector using the PQ approach (small RAM machine) vs. an index built with JVector using the raw vectors during construction (big RAM machine)? I'd be curious to understand what this technique's impact is on the final search results.

I'd also be interested to know if any other vector stores support building indexes in limited memory using the partition-then-merge approach described by DiskANN.

Finally, it's been a while since I looked at this stuff, so if I mis-wrote or mis-understood please correct me!

- DiskANN: https://dl.acm.org/doi/10.5555/3454287.3455520

- Anisotropic Vector Quantization (PQ Compression): https://arxiv.org/abs/1908.10396

- JVector/#168: How to support building larger-than-memory indexes https://github.com/jbellis/jvector/issues/168

- JVector/#244: Build indexes using compressed vectors https://github.com/jbellis/jvector/pull/244

2 comments

It's not mentioned in the original paper, but DiskANN also supports PQ at build-time via `--build_PQ_bytes`, though it's a tradeoff with the graph quality as you mention.

One interesting property in benchmarking is that the distance comparison implementations for full-dim vectors can often be more efficient than those for PQ-compressed vectors (straight-line SIMD execution vs table lookups), so on some systems cluster-and-merge is relatively competitive in terms of build performance.

That's correct!

I've tested the build-with-compression approach used here with all the datasets in JVector's Bench [1] and there's near zero loss in accuracy.

I suspect that the reason the DiskANN authors used the approach they did is that in 2019 Deep1B was about the only very large public dataset around, and since the vectors themselves are small your edge lists end up dominating your memory usage. So they came up with a clever solution, at the cost of making construction 2.5x as expensive. (Educated guess: 2x is from adding each vector to multiple partitions and the extra 50% to merge the results.)

So JVector is just keeping edge lists in memory today. When that becomes a bottleneck we may need to do something similar to DiskANN but I'm hoping we can do better because it's frankly a little inelegant.

[1] https://github.com/jbellis/jvector/blob/main/jvector-example...