Hacker News new | ask | show | jobs
by marginalia_nu 1001 days ago
Yeah. That seems like a design decision that will scale poorly. For reference, even in my dinky 100M index I have individual terms with several gigabytes of associated document references.

In general hash map table index designs don't tend to be very efficient. If you use a skip list or something similar, you can calculate the intersection between sets in sublinear time.

1 comments

We actually just take the union and then re-rank. Because the lists are all small, this is cheap.
Point is, with a skip list (or similar), the lists don't need to be small. You can intersect data sets that are enormous very quickly using this algo[1] where a single linear read of both lists is the worst case scenario.

[1] https://nlp.stanford.edu/IR-book/html/htmledition/faster-pos...