Hacker News new | ask | show | jobs
by teraflop 3988 days ago
Like a lot of designs that use Raft/Zookeeper/Paxos/whatever as a building block, the full system doesn't inherit all of the safety properties of the underlying consensus algorithm. I don't think that makes this code useless by any means, but I think it's important to be aware of the edge cases.

Consensus algorithms are popular because they're supposed to solve the difficult problem of guaranteeing consistency while attempting to provide liveness, in the presence of arbitrary node or connection failures. Etcd itself can provide this for operations on its own datastore, but that doesn't mean it can be used as a perfect failure detector for another system (which is impossible in the general case). In particular, if the database master becomes partitioned from the etcd leader for more than 30 seconds but is still accessible to clients, boom -- split brain.

(You can attempt to mitigate this with timeouts, but that's not foolproof if your system can experience clock skew or swapping/GC delays. Exactly this kind of faulty assumption has caused critical bugs in e.g. HBase in the past, turning what would otherwise be a temporary period of unavailability into data loss.)

EDIT: If I'm reading the code correctly, compose.io doesn't make any attempt to mitigate this failure scenario. If the Postgresql master can't contact etcd, it continues acting as a master indefinitely, even after 30 seconds have expired and another server might have taken over. This appears to be what happens in the "no action. not healthy enough to do anything." case in ha.py. I'd be happy to be corrected if there's something I'm missing.

1 comments

If the PostgreSQL leader doesn't reset the leader key, it's no longer leader.
The rest of the cluster doesn't think it's the leader, but the problem is that it still accepts database connections as if it were.

If a client sees a stale value of the leader key (which is possible, either through network hiccups or etcd's normal behavior of allowing reads from followers) then it could contact the old leader and perform updates which won't be visible on the new leader.

Those updates don't count though, since they weren't sent to the leader. Might as well pipe them to /dev/null. There is no need to track them.

A client shouldn't use a stale value. If a DB does not hold a valid key, it shouldn't accept new connections, or signal that data was committed.

> Those updates don't count though, since they weren't sent to the leader.

But that's exactly the problem! If you were to run this whole system under a tool like Jepsen, this would show up as "acknowledged but lost writes". It's not generally considered acceptable to connect to your database, issue an UPDATE and a COMMIT, and have everything appear to work successfully, only for the data to disappear into the aether because it got sent to the wrong replica.

> If a DB does not hold a valid key, it shouldn't accept new connections, or signal that data was committed.

Exactly, and the problem with this implementation (again, unless I'm missing something) is that it may violate this contract.

Fencing isn't quite that simple, unfortunately.

I've been doing database, and specifically PostgreSQL, administration and HA setups for a long time now. This stuff is a lot harder than people think it is. People who roll their own solutions, thinking "Oh, this will totes be good enough!" tend to find themselves very painfully surprised that it isn't.

I've seen multiple deployments (not necessarily specific to PostgresSQL) engineer themselves into a corner with what people feel will be a highly available roll your own solution, complete with convincing sounding blog posts.

In every case, at some point, there were implementation/software bug related issues that ultimately caused more unplanned outages than I've ever seen a single, well run server experience.

Based on experience is there a common bug or scenario that you see overlooked often? Like say what happens during the transition between leaders, or handling multiple failures (multiple netsplits..)?
I can't really identify a common problem. Things I've seen include:

* After a complete, planned shutdown, neither server is happy to start until it sees the other one online. In the end, neither ends up booting. * A failover occurs, at which point you find out the hard way there is state being stored in a non-replicate file. I've seen this with several different Asterisk HA solutions in particular. * A failover occurs, and non-database aware storage snapshots leave the redundant server with a non-mountable mirror of the database.

I'm just wading into the HA waters with Postgres. I somewhat understand the tradeoffs between simplicity and robustness, but what would be your recommendation on how to proceed for someone who is a newbie?
This is going to sound cynical and self-serving (even though I'm not actually available for hire right now), but find someone who knows what they're doing and buy their time. It probably won't be cheap, but it will almost without doubt be cheaper than what you'll do to yourself if you try to hand-roll database HA.
Isn’t the major problem knowing that the expert actually know what they are doing and not just think they know what they are doing? Any tips on how you can separate the true experts from the deluded?
For someone who doesn't already have a background in and depth of understanding of this stuff, I'd first probably look for any relevant blog posts or articles written by the folks with whom you're considering a consulting arrangement. If they do exist, you can do some research on the things they're talking about and hopefully get at least a first-pass approximation of their full of shit factor.

Any consultant worth their day (let alone week) rate should be also able to refer you to previous clients, from whom you can hopefully get some sense of how satisfied people are with the candidate's work.

And, ultimately, there's an intuition factor at work here. It's been my consistent experience that if a candidate gives you some kind of hinky vibe, don't use them. When folks I've worked with haven't followed that, the results have pretty reliably been poor, at best.

I'll second this. Getting your data store right is way too important to leave to the inexperienced (and I say that being, if I had to judge, the semi-experienced; I've refused gigs centered around HA data stores because I won't put my name on something I am not 100% sure will work for them without issues).
I haven't look at the code, but the failover should ensure that the HAproxy isolates the failed master ("fencing" in HA terminology).
I won't pretend to understand all the details after only a quick glance at the code, but it looks to me like HAproxy is invoking a script[1] that checks each PostgreSQL instance directly (bypassing the governor process) and enables it as an endpoint iff the instance is a replication master, as opposed to a follower. In which case, if multiple instances are in master mode, the proxy might forward connections to either one. It has no way of knowing which one is considered to be the "real" one by etcd; and even if it tried to check, that check would itself be subject to race conditions.

[1]: https://github.com/compose/governor/blob/master/haproxy_stat...