Hacker News new | ask | show | jobs
by chrismcbride 5128 days ago
I might be reading this wrong, but from the slides it seems like JOINs are only fast when they are on the same machine. Their own ORM library avoids using JOINs all together. Doesn't that defeat the point of using an RDBMS?
2 comments

If you look at the example tables, what they're doing is sharding based on customerID, so sections of those tables with the same customerID "stick" together and are guaranteed to be stored together, on the same machine(s) (plural because of replication). Their workload is such that the workload is dominated by JOINs that do not mix customerIDs, hence the JOIN is run on a single machine, inside the database process, which is fast. They do support JOINs across customerIDs, but it probably involves the client (or a proxy process) fetching rows and merging/sorting them, ie. it happens outside the server process, probably on another machine, and is much slower.

The same trade-off also happens with transactions: if all the WRITEs are for the same customerID, it will be fast with almost no performance hit compared to issuing the same commands without transactions. They also support transactions across customerIDs, which involves an expensive 2-phase commit (2PC) round across the machines (across the Paxos quorums, really), which will be much slower.

Disclaimer: I wrote a distributed database with similar design decisions.

is there something similar for any opensource rdbms, shard all tables by the same customer_id?
VoltDB makes some similar tradeoffs that allow single sharded SQL to continue to work as expected.

There is actually quite a list of things that are trying to make this work. DBShards, Clustrix, Xeround, and I am sure that isn't all of them.

Yeah, that's what I thought too. I wasn't sure how much the devs could do from a brief look at the PDF, and it felt like they got vague at that point.