Hacker News new | ask | show | jobs
by josephg 1745 days ago
> Like real distribution, when you add more servers writes distribute as well?

There's a really interesting suggestion in The Mythical Man Month. He suggests that instead of hiring more programmers to work in parallel, maybe we should scale teams by keeping one person writing all the code but have a whole team supporting them.

I don't know how well that works for programming, but with databases I think its a great idea. CPUs are obnoxiously fast. Its IO and memory which are slow. I could imagine making a single CPU core's job to be simply ordering all incoming writes relative to each other using strict serialization counters (in registers / L1 cache). If the stream of writes coming in (and going out) happened over DPDK or something, you could probably get performance on the order of 10m-100m counter updates per second.

Then a whole cluster of computers sit around that core. On one side you have computers feeding it with writes. (And handling the retry logic if the write was speculatively misordered.) And on the other side you have computers taking the firehose of writes, doing fan-out (kafka style), updating indexes, saving everything durably to disk, and so on.

If that would work, you would get full serializable database ordering with crazy fast speeds.

You would hit a hard limit based on the speed of the fastest CPU you can buy, but I can't think of much software on the planet which needs to handle writes at a rate faster than 100m per second. And doesn't have some natural sharding keys anyway. Facebook's analytics engine and CERN are the only two which come to mind.

3 comments

> I could imagine making a single CPU core's job to be simply ordering all incoming writes relative to each other

Calvin is an interesting alternate design that puts "reach global consensus on transaction order" as its first priority, and derives pretty much everything else from that. Don't even need to bottleneck through a single CPU.

http://cs-www.cs.yale.edu/homes/dna/papers/calvin-sigmod12.p...

Like VoltDB, the one huge trade-off is that there's no `BEGIN ... COMMIT` interaction where the client gets to do arbitrary things in the middle. Which would be fine, if programming business logic in the database was saner than with Postgres.

What you suggest would work if you just want to order writes. However, it won't support an ACID transaction model, which a lot of applications expect, or even simpler models where you can read values before you write.

For instance: consider an application that looks at the current value of a counter and adds 1 if the counter is less than 100. You can execute this on the primary (resource bottleneck because you need to see current state and only the primary has the full picture) or do the operation on a local node and try to submit it to the primary for ordering (coordination required over the network, e.g., to ensure data are current).

There are other approaches but they generally result in the same kind of conflicts and consequent slow-downs. Or they limit the operations you can handle, for example by only allowing conflict-free operations. That's what CRDTs do. [1]

[1] https://hal.inria.fr/hal-00932836/file/CRDTs_SSS-2011.pdf

> However, it won't support an ACID transaction model

I think you could add ACID support, while the process doing ordering still not caring about the data. You do something like this:

- Split the keyset into N buckets. Each bucket has an incrementing version number. (The first change is 1, then 2, then 3, and so on). The whole system has a vector clock with a known size (eg [3, 5, 1, 12, etc] with one version per bucket.)

- Each change specifies a validity vector clock - eg "this operation is valid if bucket 3 has version 100 and bucket 20 has version 330". This "validity clock" is configured on a replica, which is actually looking at the data itself. The change also specifies which buckets are updated if the txn is accepted.

- The primary machine only compares bucket IDs. Its job is just to receive validity vector clocks and make the decision of whether the corresponding write is accepted or rejected. If accepted, the set of "write buckets" have their versions incremented.

- If the change is rejected, the secondary waits to get the more recent bucket changes (something caused it to be rejected) and either retries the txn (if the keys actually didn't conflict) or fails the transaction back to the end user application.

So in your example:

> consider an application that looks at the current value of a counter and adds 1 if the counter is less than 100

So the write says "read key X, write key X". Key X is in bucket 12. The replica looks at its known bucket version and says "RW bucket 12 if bucket 12 has version 100". This change is sent to the primary, which compares bucket 12's version. In our case it rejects the txn because another replica had a concurrent write to another key in bucket 12. The replica receives the rejection message, checks if the concurrent change conflicts (it doesn't), then retries saying "RW bucket 12 if bucket 12 has version 101". This time the primary accepts the change, bumps its local counter and announces the change to all replicas (via fan-out).

The primary is just doing compare-and-set on a small known array of integers which fit in L1 cache, so it would be obscenely fast. The trick would be designing the rest of the system to keep replica retries down. And managing to merge the firehose of changes - but because atomicity is guaranteed you could shard pretty easily. And there's lots of ways to improve that anyway - like coalescing txns together on replicas, and so on.

I'm really sorry didn't see your comment earlier. I like your solution but am a little stuck on the limitations.

1.) It requires applications to specify the entire transaction in advance. ACID transactions allow you to begin a transaction, poke around, change something, and commit. You can derive the transaction by running it on a replica, then submitting to the primary, in which case this becomes an implementation of optimistic locking including transaction retries.

2.) As you pointed out the trick is to keep the array small. However, this only works if you have few conflicts. Jim Grey, Pat Helland, and friends pointed this out in 1996. [1] That in turn seems to imply a very large number of buckets for any non-trivial system, which seems like a contradiction to the conditions for high performance. In the limit you would have an ID for every key in the DBMS.

3.) Finally, what about failures? You'll still need a distributed log and leader election in case your fast machine dies. This implies coordination, once again slowing things down to the speed of establishing consensus on the network to commit log records.

Incidentally Galera uses a similar algorithm to what you propose. [2] There are definitely applications where this works.

[1] https://dsf.berkeley.edu/cs286/papers/dangers-sigmod1996.pdf

[2] https://galeracluster.com/library/documentation/certificatio...

>There's a really interesting suggestion in The Mythical Man Month. He suggests that instead of hiring more programmers to work in parallel, maybe we should scale teams by keeping one person writing all the code but have a whole team supporting them.

Sounds like mob programming!