| HN Mirror

The IVF indexing can be considered into two phases, computing the centroids (KMeans), and assigning each point to the centroids as the inverted lists. The most time-consuming part is at the KMeans stage, and can be greatly accelerated with GPU. 1M 960dim vec can be clustered in less than 10s. We did the KMeans phase externally, and the assignment phase inside postgres. The KMeans part depends only on the data distribution, not on any specific data. So we can do sampling on the data, and inserting/deleting the data won't affect the KMeans result significantly. For the update, it's just a matter of assigning the new vector to a specific cluster and appending it to the corresponding list. It's very light compared to inserting in hnsw