Hacker News new | ask | show | jobs
by mirekrusin 1748 days ago
It always rubs me the wrong way - all those paxos/raft approaches (which are great, but...) simply elect the leader to pick writes. In that sense there is no distribution of computation at all. It's still single target that has to cruch through updates. Replication is just for reads. Are we going to have something better anytime soon? Like real distribution, when you add more servers writes distribute as well?
6 comments

> Like real distribution, when you add more servers writes distribute as well?

There's a really interesting suggestion in The Mythical Man Month. He suggests that instead of hiring more programmers to work in parallel, maybe we should scale teams by keeping one person writing all the code but have a whole team supporting them.

I don't know how well that works for programming, but with databases I think its a great idea. CPUs are obnoxiously fast. Its IO and memory which are slow. I could imagine making a single CPU core's job to be simply ordering all incoming writes relative to each other using strict serialization counters (in registers / L1 cache). If the stream of writes coming in (and going out) happened over DPDK or something, you could probably get performance on the order of 10m-100m counter updates per second.

Then a whole cluster of computers sit around that core. On one side you have computers feeding it with writes. (And handling the retry logic if the write was speculatively misordered.) And on the other side you have computers taking the firehose of writes, doing fan-out (kafka style), updating indexes, saving everything durably to disk, and so on.

If that would work, you would get full serializable database ordering with crazy fast speeds.

You would hit a hard limit based on the speed of the fastest CPU you can buy, but I can't think of much software on the planet which needs to handle writes at a rate faster than 100m per second. And doesn't have some natural sharding keys anyway. Facebook's analytics engine and CERN are the only two which come to mind.

> I could imagine making a single CPU core's job to be simply ordering all incoming writes relative to each other

Calvin is an interesting alternate design that puts "reach global consensus on transaction order" as its first priority, and derives pretty much everything else from that. Don't even need to bottleneck through a single CPU.

http://cs-www.cs.yale.edu/homes/dna/papers/calvin-sigmod12.p...

Like VoltDB, the one huge trade-off is that there's no `BEGIN ... COMMIT` interaction where the client gets to do arbitrary things in the middle. Which would be fine, if programming business logic in the database was saner than with Postgres.

What you suggest would work if you just want to order writes. However, it won't support an ACID transaction model, which a lot of applications expect, or even simpler models where you can read values before you write.

For instance: consider an application that looks at the current value of a counter and adds 1 if the counter is less than 100. You can execute this on the primary (resource bottleneck because you need to see current state and only the primary has the full picture) or do the operation on a local node and try to submit it to the primary for ordering (coordination required over the network, e.g., to ensure data are current).

There are other approaches but they generally result in the same kind of conflicts and consequent slow-downs. Or they limit the operations you can handle, for example by only allowing conflict-free operations. That's what CRDTs do. [1]

[1] https://hal.inria.fr/hal-00932836/file/CRDTs_SSS-2011.pdf

> However, it won't support an ACID transaction model

I think you could add ACID support, while the process doing ordering still not caring about the data. You do something like this:

- Split the keyset into N buckets. Each bucket has an incrementing version number. (The first change is 1, then 2, then 3, and so on). The whole system has a vector clock with a known size (eg [3, 5, 1, 12, etc] with one version per bucket.)

- Each change specifies a validity vector clock - eg "this operation is valid if bucket 3 has version 100 and bucket 20 has version 330". This "validity clock" is configured on a replica, which is actually looking at the data itself. The change also specifies which buckets are updated if the txn is accepted.

- The primary machine only compares bucket IDs. Its job is just to receive validity vector clocks and make the decision of whether the corresponding write is accepted or rejected. If accepted, the set of "write buckets" have their versions incremented.

- If the change is rejected, the secondary waits to get the more recent bucket changes (something caused it to be rejected) and either retries the txn (if the keys actually didn't conflict) or fails the transaction back to the end user application.

So in your example:

> consider an application that looks at the current value of a counter and adds 1 if the counter is less than 100

So the write says "read key X, write key X". Key X is in bucket 12. The replica looks at its known bucket version and says "RW bucket 12 if bucket 12 has version 100". This change is sent to the primary, which compares bucket 12's version. In our case it rejects the txn because another replica had a concurrent write to another key in bucket 12. The replica receives the rejection message, checks if the concurrent change conflicts (it doesn't), then retries saying "RW bucket 12 if bucket 12 has version 101". This time the primary accepts the change, bumps its local counter and announces the change to all replicas (via fan-out).

The primary is just doing compare-and-set on a small known array of integers which fit in L1 cache, so it would be obscenely fast. The trick would be designing the rest of the system to keep replica retries down. And managing to merge the firehose of changes - but because atomicity is guaranteed you could shard pretty easily. And there's lots of ways to improve that anyway - like coalescing txns together on replicas, and so on.

I'm really sorry didn't see your comment earlier. I like your solution but am a little stuck on the limitations.

1.) It requires applications to specify the entire transaction in advance. ACID transactions allow you to begin a transaction, poke around, change something, and commit. You can derive the transaction by running it on a replica, then submitting to the primary, in which case this becomes an implementation of optimistic locking including transaction retries.

2.) As you pointed out the trick is to keep the array small. However, this only works if you have few conflicts. Jim Grey, Pat Helland, and friends pointed this out in 1996. [1] That in turn seems to imply a very large number of buckets for any non-trivial system, which seems like a contradiction to the conditions for high performance. In the limit you would have an ID for every key in the DBMS.

3.) Finally, what about failures? You'll still need a distributed log and leader election in case your fast machine dies. This implies coordination, once again slowing things down to the speed of establishing consensus on the network to commit log records.

Incidentally Galera uses a similar algorithm to what you propose. [2] There are definitely applications where this works.

[1] https://dsf.berkeley.edu/cs286/papers/dangers-sigmod1996.pdf

[2] https://galeracluster.com/library/documentation/certificatio...

>There's a really interesting suggestion in The Mythical Man Month. He suggests that instead of hiring more programmers to work in parallel, maybe we should scale teams by keeping one person writing all the code but have a whole team supporting them.

Sounds like mob programming!

There are two ways I see databases doing paxos. The basic way, like some databases like Percona, basically is a single raft across the whole database. It can help with high availability, but the database scale is still kind of constrained to the capability of a single writer.

What you really want is databases like CRDB/Yugabyte/TiDB, which are sharding+raft. Tables are sharded into 128MB chunks, and each chunk has their own raft. The database handles transactions, distributed queries, and auto-balancing transparently.

Definitely, YugabyteDB falls into the second category. Tables are split into so called tablets. This splitting can be controlled by choosing the hashing algorithm for the primary key (or transparent row key hash, if no primary key for a table exists). Each tablet is replicated and has its own raft group. Different tables can have different replication factors, the number of tablets to split the table into can be selected and modified.

YugabyteDB recently added support for table spaces on steroids where table spaces are allocated to physical nodes in the cluster. This enables geo-replication features where tables or rows can be placed within selected geo locations.

All the data shifting is done transparently by the database.

Doesn't CockroachDB work this way as well, with each "range" running a separate Raft? That is what I get from https://www.cockroachlabs.com/docs/stable/architecture/overv...
All those noSQL or newSQL have more than one leader per table they have a distinct leader for each partition.

So if you have 6 server and 2 partition you could have 3 servers for partition number #1 and 3 different servers for partition number #2.

If you want extra performance you could make the server simply store key->value mapping using quorum write like Cassandra is doing but to keep data consistency you still have 2 choice

#1 use Optimistic concurrency (an app performing an update will verify if the data has changed since the app last read that data).

#2 using some kind of Lease, elect one machine to be the only one allowed to write to that partition for some time period.

Option #1 do not give faster transaction throughput but could offer lower tail latency.

Option #2 bring you back to square one of having a leader so you better just use (Paxos/Raf)

You are asking for a miracle or just simply - breach of the physics laws. The only way to horizontally scale write speed is to shard your data to be written somehow. You can easily do it but then your reading queries have to go to multiple servers to assemble single result. You can't scale both at the same time. There are some in between solutions like eventual read consistency that are relying on traffic at some point easing enough so that the conductor can actually synchronize servers. But if you have a steady stream of read/write requests the only way you really can scale is to tell amazon to sod off, buy yourself big fat multicore / multi CPU server with nice SSD array (preferably of Optane type) and watch your processing power suddenly shoot through the roof while the cost greatly decreases ;)
> There are some in between solutions like eventual read consistency that are relying on traffic at some point easing enough so that the conductor can actually synchronize servers.

You can use CRDT's to give a formally correct semantics to these "inconsistent" scenarios. And they might well be something that's best explored in a not-purely-relational model, since the way they work is pretty unique and hard to square with ordinary relational db's.

CRDT is just a fancy term for the strategy that still has to eventually merge data. In case of CRDT the data organized / designed in a way that makes it easier. The keyword here is "merging" which by definition kills "infinite" scalability.

You can dance around all you want but you just can't beat laws of nature.

Sure you can always design something that works for your particular case but generic solution is not possible.

Can't you arrange merging into ie. binary tree - in that setup you'd be collapsing merges into single one at the root and cummulative throughput at leaf nodes could be exponentially higher?
data is also sharded in those databases
true, thanks for pointing out, but this is a bit cheating isn't it? ie. atomic transactions cross shards flip over, right? it's basically ergonomic equivalent of just using multiple databases?
Not 100% sure what you mean by flipping over (fail?) but at least in CRDB you can of course do cross shard transactions. They wont be as fast as a transaction that can be completely handled by a single leader but it works fine and is transparent to the client.
You move over into the world of distributed transactions, which can be really expensive.

Thankfully sharding works great for a large number of applications (or in other cases you can accept eventual consistency).

Cross-shard transactions will use some sort of two-phase commit protocol internally. They still work just fine (the extra complexity is transparent to the client), but with somewhat worse performance. Most of the difficulty/expertise involved in using such systems optimally is in designing your schema and access patterns such that you minimize the number of cross-shard transactions.
CAP theorem. Choose two, it's a fundamental limitation of scaling a database.
CAP theorem. You can't have guarantee for having all three all the time. But you can still have nice thing most of the time.

Even better, sometimes you can change which guarantee you need.

We can do better then "pick two"

yes, but no. because in reality what we really want is not true pure AP or CP (or CA, which you can't have anyway)

https://www.youtube.com/watch?v=hUd_9FENShA

(CAP is a very important result about a very strong assumption of consistency: linearizable events, but for that you can't lose any messages [if I remember it correctly], otherwise the system will become inconsistent)

Thanks for the video, very interesting.
I've always found CAP theorem to be better described as:

Given the need for Partitions, you must choose between prioritizing Consistency or Availability.

Google Spanner does the three.
> The purist answer is “no” because partitions can happen and in fact have happened at Google, and during some partitions, Spanner chooses C and forfeits A. It is technically a CP system.

https://cloud.google.com/blog/products/databases/inside-clou...