| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by o11c 950 days ago

Note that the real primitive is "find nearest [with hint]", which has to be the #1 thing I miss in Python.

For B-trees it's only going to be a significant win if the chance of intersection is small, less than about `1/(nodes_per_block)`. For binary trees it's a much bigger win since that becomes 1/2 and binary trees are horrible on cache.

Hmm, can you efficiently intersect an arbitrary number of B-trees using the same idea as a heap-merge, but with a max heap instead of a min heap? You'd still have to iterate over all on a match, but as long as 2 input trees don't have an intersection, you don't have to look at the others at all ... or does that become equivalent to just doing them in series?

1 comments

marginalia_nu 950 days ago

A nuance that is important here is that not all accesses are equal within the context of disk reads. B-trees are designed to minimize block reads, not memory operations.

I guess there are worst case scenarios with evenly spaced intersections spread out exactly one per block, but in terms of block reads it fundamentally doesn't matter how you intersect two such trees, you'll still have to read all blocks, and that is orders of magnitude slower than comparing the values within.

I think the tree structure can be considered cached in real world scenarios; not really relevant to the performance you'll get.