| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jedberg 4091 days ago

I don't agree with his fundamental premise:

> Network Partitions are Rare, Server Failures are Not

Network partitions happen all the time. Sure, the whole "a switch failed and that piece of the network isn't there anymore" doesn't happen a lot, but what does happen a lot is a slow or delayed connection, or a machine going offline for a few seconds.

5 comments

lobster_johnson 4091 days ago

Indeed. We have some external servers in a partner's datacenter which runs under VMware's vMotion. Every now and then vMotion will shuffle a VM to another physical server, causing the entire OS to freeze for several (often 20-30) seconds, and everything that is partition-sensitive, like RabbitMQ and Elasticsearch, throws a tantrum and keels over.

Even VMs on more statically allocated clouds like DigitalOcean and AWS will experience small, constant blips that affect your whole stack.

What annoys me in particular is that these blips affect everything. Every app needs to fail gracefully, be it a PostgreSQL client connections, a Memcached lookup or an S3 API call. The fact that such catch-and-retry boilerplate logic needs to built into the application layer, and every layer within it, is still something I find rather insane. It leaks into the application logic in often rather insidious ways, or in ways that pollutes your code with defenses. Everything has to be idempotent, which is easy enough for transactional database stuff, less easy for things like asynchronous queues that fire off emails. Erlang has already provided a solution to the problem, but I suspect we need OS-level support to avoid reinventing the wheel in every language and platform. /rant

jchrisa 4091 days ago

Our customers tend to be the kind who need extreme performance, so they aren't spanning cluster across WANs. For well-tuned datacenters rack awareness (putting the replicas in sane places), is more useful.

For WAN replication we have a cross-datacenter replication which works on an AP model.

jedberg 4091 days ago

To be clear, I was in no way making a judgement about CouchDB vs. Cassandra. I've only give Couch a cursory glance so I wouldn't be qualified to make such a judgement.

I was simply trying to point out that while you may have a very good argument as to why Couch is better, the network partition argument is not sound, and you may want to look for a better argument to make.

I'm personally against single masters because they are SPOFs. With a master, at some point there needs to be a single arbiter of truth, and if that is unavailable, then the system is unavailable.

strmpnk 4091 days ago

A nitpick, but an important one which I wish the Couchbase folks wouldn't let slip as often as they do, CouchDB has very different properties from Couchbase and should be considered entirely different database designs regardless of the availability of a sync gateway for replication with a number of JSON stores.

jchrisa 4091 days ago

Different database designs, but similar document model. In fact, Couchbase Sync Gateway is capable of syncing between Couchbase Server and Apache CouchDB. Also our iOS, Android, and .NET libraries can sync with CouchDB and PouchDB. Everything open source, of course. More info: http://developer.couchbase.com/mobile/

strmpnk 4091 days ago

It's not that it can sync, nor the data model.

It's that these are fundamentally very different databases with different trade offs. You can't just take one and adjust some API calls and expect things to work in a similar way. It only confuses people when it's quietly ignored and others assume that since it wasn't pointed out to be wrong that it must be the same thing.

I've had far too many conversations with people who use Couchbase that can't tell the difference that I would say that it's just general confusion. It's lax work on Couchbase's part and a thorn in the Apache CouchDB project that there is no effort to help clarify the fact that they are indeed independent and now very different databases.

jchrisa 4091 days ago

Exactly. Couchbase trades some availability during rebalance for more lightweight client interactions. As Damien argues in his post, this allows it to meet the same SLA with less hardware.

bbromhead 4091 days ago

Cassandra has rack awareness...

leef 4091 days ago

Really it doesn't matter and your point only goes to show one of Cassandra's flaws. When either a network partition or a server failure happens Cassandra starts reshuffling data amongst multiple hosts and filling network pipes. Contrast this with a setup where you have a static partitioning of hosts to partition and a leader per partition. Then you only need to (possibly) elect a new leader and carry on.

This is especially relevant when you need to do these things because of unexpected load increases or the loss of hosts in your cluster.

jedberg 4091 days ago

That's incorrect. Data doesn't shuffle on a partition unless you do it manually.

leef 4091 days ago

I realize your previous comment was more in reference to transient network partitions so my comment is out of place. But whether the mechanism is manual or automatic once a network partition is discovered the reshuffle begins.

_benedict 4090 days ago

It would be very uncommon to perform a token rebalance (or bring up replacement nodes) under a network partition, since those nodes are fine. The idea is to be tolerant to a network partition, which multi-master is, not to then attempt to patch up a whole new network while the partition is in place. That could easily bring down the entire cluster.

Data is only typically shuffled around when a replacement node is introduced to the cluster.

If you have multiple independent network partitions that isolate all of your RF nodes, then there is no database that could function safely in this scenario, and this has nothing to do with data shuffling.

jedberg 4091 days ago

While this is true, I'm not sure why that's a problem. The system can still function while the data is in transit from one node to another. As long as the right 2/3 of the machines are up, the cluster can function at 100%, and as long as the right 1/3 are up, it can still serve reads (assuming a quorum of 3).

YZF 4091 days ago

You mean replication factor of 3? It doesn't work like that. If your replication factor is 3 and you're using quorum read/writes then as soon as two machines are down some of the reads and writes will fail. The more machines down the higher the probability of failure. That's why you have to start shuffling data around to maintain your availability which is a problem... (EDIT: assuming virtual nodes are used, otherwise it's a little different)

jedberg 4090 days ago

Like I said, it depends on how you lay out your data. Let's say you have three data centers, and you lay our your data such that there is one copy in each datacenter (this is how Netflix does it for example).

You could then lose an entire datacenter (1/3 of the machines) and the cluster will just keep on running with no issues.

You could lose two datacenters (2/3s of the machines) and still serve reads as long as you're using READ ONE (which is what you should be doing most of the time).

_benedict 4090 days ago

If a machine is down (as in the machines themselves are dead) then you absolutely need to move the data onto another node to avoid losing it entirely. There is no way to avoid this, whatever your consistency model.

If there is a network partition, however, there is no need to move the data since it will eventually recover; moving it would likely make the situation worse. Cassandra never does this, and operators should never ask it to.

If you have severe enough network partitions to isolate all of your nodes from all of its peers, there is no database that can work safely, regardless of data shuffling or consistency model.

jermo 4090 days ago

We don't even need to agree or disagree. There are numerous studies on how often network partitions happen. Aphyr & Bailis paper 'The Network Is Reliable' [1] has a very detailed overview of these studies.

[1] https://aphyr.com/posts/288-the-network-is-reliable

grahamux 4091 days ago

> but what does happen a lot is a slow or delayed connection, or a machine going offline for a few seconds.

This is especially true for cross-datacenter rings across the public internet.

eternalban 4091 days ago

That is potentially a partition. Anything that violates the SLA is a an effective partition.

eternalban 4090 days ago

Seriously HN, that is fully on topic and did not deserve a down vote.