Hacker News new | ask | show | jobs
by bunsenmcdubbs 2072 days ago
And great question! The hashing heuristic is a bit of a cute trick and I definitely glossed over it a bit in the post. I'll try to elaborate a bit more here...

> most hashes ... are essentially one way functions, ideally preserving no information about the input at all

You are correct, an individual hash does not preserve any information about the node, nor does it approximate rank in the union-find graph. We only get the effect in aggregate when many unions are performed and we repeatedly hash and select the lower value as the new root.

The hash for a given node in the graph is deterministic and fixed. It is effectively a random value. But the hash for a tree of nodes (aka the hash of the tree's root node) decreases with each union operation. This is because we select the root with the lower hash to become the root of the combined tree every union operation. As a result, the node with the lowest hash is the root of each tree. With more union operations, larger trees will tend to have lower and lower hashes - effectively approximating a rank.

Put another (more handwave-y) way, we essentially roll the dice and get a fixed random number (hash) for each node in the tree. Since the tree is assigned the lowest random number contained within, larger trees will probably have lower values than smaller trees (more dices rolls to get a lower number).

3 comments

Ahhh I see now. I was so caught up trying to think of ways to make the hash function give you this property, I didn't realize that it didn't matter and that it emerged as a result of you picking which hash to assign to a tree. Makes sense now, thanks.
I think a simplified explanation is that, assuming your hash function is ideal (can be assumed to be random, except for being deterministic), you end up building a treap, minus the treap's restriction on the in-order traversal of keys and restriction on the number of children. The height of a treap is O(log N), so your solution is expected to be within a constant factor of an optimal solution.

[0] https://en.wikipedia.org/wiki/Treap

I think there's a paper somewhere claiming that completely random linking is just as good as union-by-rank (in expectation).
Yes, there's a few papers over the last few years on this, starting with this paper: https://www.cis.upenn.edu/~sanjeev/papers/soda14_disjoint_se...

Here is a more recent paper (from this year) analyzing concurrent union find with random linking: https://arxiv.org/pdf/2003.01203.pdf

Woah these look great! Sad I didn't encounter them earlier.

Our team is actively working on improvements in this part of our data infrastructure (hence this project & blog post) so maybe there will be a follow up coming up with the next version of our Identity implementation...

Isn't that expected (within a constant factor of optimal depth, a.k.a. O(log N) tree height)? They're essentially building a treap, minus the in-order traversal of keys restriction.

[0] https://en.wikipedia.org/wiki/Treap

Optimal depth for union-find is O(1); every node points directly to the root. It's a very different case than building a binary search tree.