| Personally, I would try to go for a simpler solution. In case of a failover event which is already complicated in itself and happening at a point in time where stuff is already going wrong (there would be no failover otherwise), do you really want to have all this additional infrastructure with etcd and haproxy as a dependency? If you can live with a few minutes of downtime, I would recommend to trigger your failover using human intervention once you have ascertained that the failover would actually help (you never, ever want to fail over if master doesn't respond in time due to high load - at that point, failing over will only make things worse due to cold caches). See https://github.com/blog/1261-github-availability-this-week for a nice story of automated DB failover going wrong. In our case, we're running keepalived to share the IP address of the postgres master, but we don't actually automatically act on PG availability changes. In a situation that actually warrants the failover, a human will kill the master node by shutting it down and keepalived will select another master and trigger the failover (which is then automated using `trigger_file` in `recovery.conf`). In this case we have only one additional piece of infrastructure (keepalived) and we can be sure that we don't accidentally make our lives miserable with automated failovers. The cost is, of course, potential additional downtime while somebody checks the situation, does minimal emergency root cause analysis and then shuts down the failed master. In the even rarer case of hardware failure, keepalived would of course fail over automatically, but let's be honest: Most failures are caused by application or devops issues and in these cases it pays off to be diligent instead of panicing. |
Better to do an assessment of each thing that can fail, how to isolate/detect it, how to recover from it, how to implement that with available tools, and implement it. Test it in a number of situations on same hardware, network, and apps you'll use in production. Once it's solid, put them into production. Then, never worry about that stuff again past monitoring and maintenance.
Btw, Netflix employs Monkeys to do this. Open-sources their tools with blog writeups on their use, too. I'm sure you Humans will be able to handle it. ;)