| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by marginalia_nu 1001 days ago
	Yeah. That seems like a design decision that will scale poorly. For reference, even in my dinky 100M index I have individual terms with several gigabytes of associated document references. In general hash map table index designs don't tend to be very efficient. If you use a skip list or something similar, you can calculate the intersection between sets in sublinear time.

1 comments

daoudc 1001 days ago

We actually just take the union and then re-rank. Because the lists are all small, this is cheap.

link

marginalia_nu 1001 days ago

Point is, with a skip list (or similar), the lists don't need to be small. You can intersect data sets that are enormous very quickly using this algo[1] where a single linear read of both lists is the worst case scenario.

[1] https://nlp.stanford.edu/IR-book/html/htmledition/faster-pos...

link