Hacker News new | ask | show | jobs
by davidebaldini 1001 days ago
If I understand, having only 4096 bytes of data per term causes multiple terms in the same query to intersect to little or no results. The purpose seems to cut cost in compromise of completeness.
2 comments

Yeah. That seems like a design decision that will scale poorly. For reference, even in my dinky 100M index I have individual terms with several gigabytes of associated document references.

In general hash map table index designs don't tend to be very efficient. If you use a skip list or something similar, you can calculate the intersection between sets in sublinear time.

We actually just take the union and then re-rank. Because the lists are all small, this is cheap.
Point is, with a skip list (or similar), the lists don't need to be small. You can intersect data sets that are enormous very quickly using this algo[1] where a single linear read of both lists is the worst case scenario.

[1] https://nlp.stanford.edu/IR-book/html/htmledition/faster-pos...

Yes, you're correct on the purpose. We mitigate it a little by also indexing on bigrams.