| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by Maro 5175 days ago

If you look at the example tables, what they're doing is sharding based on customerID, so sections of those tables with the same customerID "stick" together and are guaranteed to be stored together, on the same machine(s) (plural because of replication). Their workload is such that the workload is dominated by JOINs that do not mix customerIDs, hence the JOIN is run on a single machine, inside the database process, which is fast. They do support JOINs across customerIDs, but it probably involves the client (or a proxy process) fetching rows and merging/sorting them, ie. it happens outside the server process, probably on another machine, and is much slower.

The same trade-off also happens with transactions: if all the WRITEs are for the same customerID, it will be fast with almost no performance hit compared to issuing the same commands without transactions. They also support transactions across customerIDs, which involves an expensive 2-phase commit (2PC) round across the machines (across the Paxos quorums, really), which will be much slower.

Disclaimer: I wrote a distributed database with similar design decisions.

1 comments

ddorian43 5175 days ago

is there something similar for any opensource rdbms, shard all tables by the same customer_id?

link

arielweisberg 5175 days ago

VoltDB makes some similar tradeoffs that allow single sharded SQL to continue to work as expected.

There is actually quite a list of things that are trying to make this work. DBShards, Clustrix, Xeround, and I am sure that isn't all of them.

link