| > JVector, the library that powers DataStax Astra vector search, now supports indexing larger-than-memory datasets by performing construction-related searches with compressed vectors. This means that the edge lists need to fit in memory, but the uncompressed vectors do not, which gives us enough headroom to index Wikipedia-en on a laptop. It's interesting to note that JVector accomplishes this differently than how DiskANN described doing it. My understanding (based on the links below, but I didn't read the full diff in #244) is that JVector will incrementally compress the vectors it is using to construct the index; whereas DiskANN described partitioning the vectors into subsets small enough that indexes can be built in-memory using uncompressed vectors, building those indexes independently, and then merging the results into one larger index. OP, have you done any quality comparisons between an index built with JVector using the PQ approach (small RAM machine) vs. an index built with JVector using the raw vectors during construction (big RAM machine)? I'd be curious to understand what this technique's impact is on the final search results. I'd also be interested to know if any other vector stores support building indexes in limited memory using the partition-then-merge approach described by DiskANN. Finally, it's been a while since I looked at this stuff, so if I mis-wrote or mis-understood please correct me! - DiskANN: https://dl.acm.org/doi/10.5555/3454287.3455520 - Anisotropic Vector Quantization (PQ Compression): https://arxiv.org/abs/1908.10396 - JVector/#168: How to support building larger-than-memory indexes https://github.com/jbellis/jvector/issues/168 - JVector/#244: Build indexes using compressed vectors https://github.com/jbellis/jvector/pull/244 |
One interesting property in benchmarking is that the distance comparison implementations for full-dim vectors can often be more efficient than those for PQ-compressed vectors (straight-line SIMD execution vs table lookups), so on some systems cluster-and-merge is relatively competitive in terms of build performance.