| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by davidebaldini 1001 days ago
	If I understand, having only 4096 bytes of data per term causes multiple terms in the same query to intersect to little or no results. The purpose seems to cut cost in compromise of completeness.

2 comments

marginalia_nu 1001 days ago

Yeah. That seems like a design decision that will scale poorly. For reference, even in my dinky 100M index I have individual terms with several gigabytes of associated document references.

In general hash map table index designs don't tend to be very efficient. If you use a skip list or something similar, you can calculate the intersection between sets in sublinear time.

link

daoudc 1001 days ago

We actually just take the union and then re-rank. Because the lists are all small, this is cheap.

link

marginalia_nu 1001 days ago

Point is, with a skip list (or similar), the lists don't need to be small. You can intersect data sets that are enormous very quickly using this algo[1] where a single linear read of both lists is the worst case scenario.

[1] https://nlp.stanford.edu/IR-book/html/htmledition/faster-pos...

link

daoudc 1001 days ago

Yes, you're correct on the purpose. We mitigate it a little by also indexing on bigrams.

link