Hacker News new | ask | show | jobs
by kraghen 3219 days ago
I'm happy to see this excellent paper mentioned. Fritz Henglein (the author) was my thesis supervisor last year, and I worked on developing some of his ideas further.

In particular, I generalised discrimination (and added a touch of linear algebra) to devise a simple multi-way join algorithm that computes the join of any number of relations in optimal time (in a specific technical sense). Such results have been obtained recently, but only with far more complicated algorithms.

Alas, the fully general proof of optimality eludes us, so nothing has been published yet.

2 comments

Is there anything written about your multi-way join algorithm? We've done some really interesting implementations of Generic Join and the like and are always interested in hearing about other potentially better join algorithms.
It's only described in my thesis, which is not yet publicly available. (It should be, but the university thesis publication process seems to have a latency measured in years...)

The case for triangle joins is simple to describe, though. Suppose we have attributes A, B and C and finite maps (relations) R : A x B -> X, S : A x C -> Y and T : B x C -> Z. Now build tries for each so we have

abx : Trie A (Trie B X)

acy : Trie A (Trie C Y)

bcz : Trie B (Trie C Z)

We can join two tries using join : Trie k v1 -> Trie k v2 -> (v1 -> v2 -> v) -> Trie k v. Putting this all together we have:

triangleJoin : Trie A (Trie B (Trie C (X, Y, Z)))

triangleJoin = join abx acy (\bx cy -> join bx bcz (\x cz -> join cy cz (\y z -> (x, y, z))))

Compared to a database that doesn't know how to handle triangle joins the difference is absolutely stunning. The above one-liner managed to do in 5 seconds what took MySQL over an hour...

Is there any way I could get a preprint of your thesis?

I've been working on a project that sits somewhere between spark and riak and this actually might address a core issue I have both at the node and query dispatch level.

I can only do a humble and poor job of explaining this process by analogy to other algorithms. If you've got a better explanation I could rote memorize, I'd be very appreciative.
I can't say I have had much success explaining my thesis clearly to anyone except people from the same department, but I can try to give my understanding of discrimination from an implementor's perspective.

The succinct version is that all (discrete) data can be serialised as bit strings and those bit strings can be sorted in linear time by (say) radix sort.

There are two ideas at work here: that all (discrete) data structures ultimately are composed of primitive types that can be sorted in linear time, and that we should only examine each primitive value once.

However, to be fair it should be mentioned that the analysis uses a different machine model (the RAM model) than is usually employed in sorting (where all comparisons are considered to take constant time, which is arguably an inferior model of real computers if you are sorting e.g. strings of variable length).

To be honest, though, the original paper is so well-written that I have a hard time explaining it better.