Hacker News new | ask | show | jobs
by pilif 3988 days ago
Personally, I would try to go for a simpler solution. In case of a failover event which is already complicated in itself and happening at a point in time where stuff is already going wrong (there would be no failover otherwise), do you really want to have all this additional infrastructure with etcd and haproxy as a dependency?

If you can live with a few minutes of downtime, I would recommend to trigger your failover using human intervention once you have ascertained that the failover would actually help (you never, ever want to fail over if master doesn't respond in time due to high load - at that point, failing over will only make things worse due to cold caches).

See https://github.com/blog/1261-github-availability-this-week for a nice story of automated DB failover going wrong.

In our case, we're running keepalived to share the IP address of the postgres master, but we don't actually automatically act on PG availability changes.

In a situation that actually warrants the failover, a human will kill the master node by shutting it down and keepalived will select another master and trigger the failover (which is then automated using `trigger_file` in `recovery.conf`).

In this case we have only one additional piece of infrastructure (keepalived) and we can be sure that we don't accidentally make our lives miserable with automated failovers.

The cost is, of course, potential additional downtime while somebody checks the situation, does minimal emergency root cause analysis and then shuts down the failed master.

In the even rarer case of hardware failure, keepalived would of course fail over automatically, but let's be honest: Most failures are caused by application or devops issues and in these cases it pays off to be diligent instead of panicing.

3 comments

VMS clusters on VAXen were doing fail-overs perfectly in the 80's. All kinds of products and software (even FOSS) do it today. You're telling me that, in 2015, you are doing manual failovers despite tons of free tools to automate it reliably?

Better to do an assessment of each thing that can fail, how to isolate/detect it, how to recover from it, how to implement that with available tools, and implement it. Test it in a number of situations on same hardware, network, and apps you'll use in production. Once it's solid, put them into production. Then, never worry about that stuff again past monitoring and maintenance.

Btw, Netflix employs Monkeys to do this. Open-sources their tools with blog writeups on their use, too. I'm sure you Humans will be able to handle it. ;)

If you are running in Microsoft Azure you need two VM instances to get any form of availability SLAs. Microsoft can reboot/migrate single instances whenever they feel like it. With manual failover you would only have a few minute downtime if someone is there to trigger it. That honestly sounds like a crappy solution 2015..
What? In the case of Microsoft rebooting/migrating an instance causing a failure, Keepalived will automatically failover.

The manual failover is in case of something going horribly wrong (outside of hardware failure), in which case a human steps in, looks at the situation, determines the best solution... and if it's failover, they initiate the failover.

I've personally used this procedure in the past and it worked 100% of the time there was a failure in a production environment. The tricky part is then notifying the hell out of everyone who needs to be notified that something really bad has happened, a failover occurred, everything is OK, but it needs some attention ASAP.

In PGSQL world, there are even a handful of tools to help you turn the old (failed) master into a slave, and correctly escalate the old (promoted) slave into a master; all in a single command on each side (which can be kicked off through keepalived).

And you can do this 24/7 since you are awoke everyday and night?
"The tricky part is then notifying the hell out of everyone who needs to be notified that something really bad has happened, a failover occurred, everything is OK, but it needs some attention ASAP."

If there is a need for multiple 9's of uptime, there should be an escalation process for these kinds of events, which will probably include 24/7 on-call rotations.

Even if the problem is entirely self-resolving, it should still be looked at by more than one system. It should be noted, observed, documented, and confirmed it's truly resolved. That system is usually a human, but it doesn't necessarily have to be.

Can keepalived automatically float MAC addresses nowadays? Last time I checked, that didn't work and clients needed an arp flush to use the new master.
Shouldn't sending a gratuitous ARP work to update clients?
keepalived uses vrrp and will issue a gratuitous arp