Hacker News new | ask | show | jobs
by HammadB 747 days ago
"The obstacle is that until now, off-the-shelf vector databases could not index a dataset larger than memory, because both the full-resolution vectors and the index (edge list) needed to be kept in memory during index construction. Larger datasets could be split into segments, but this means that at query time they need to search each segment separately, then combine the results, turning an O(log N) search per segment into O(N) overall."

How is a log N search over S segments O(N)?

2 comments

I was trying to make the point that the dominant factor becomes linear instead of logarithmic, but more accurately it's O(S log N) = O(N log N) because S is proportional to N.
Sure, I see. I think this is an area where complexity analysis doesn’t lead to useful information.

To be more correct it’s O(N/C log C) where C is the capacity of a segment. In this case you can ignore 1/C and log C as constant. So now sure, you actually just have O(N). But this is not super useful as it says that a segmented hnsw approach and brute force approach are the same - when this is really not the case in practice.

Also O(N log N) > O(N) so I’m not sure why we would ever do anything with segmentation according to that analysis if it were correct.

> I’m not sure why we would ever do anything with segmentation according to that analysis if it were correct.

What's your alternative when you can't build an index larger than C?

If segmented hsnw indices were O(N log N) - it would make no sense to build the index at all - brute force would be better as O(N log N) > O(N)
Doesn't doubling N double S?