Hacker News new | ask | show | jobs
by swlp21 1632 days ago
Indeed! That's been my route out of trouble on multiple occasions.

The setup for the first time takes some effort to get it working and that first fail-over event is both panic inducing and wonderful to see when that replica becomes a primary.

Like every disaster recovery plan, the secret is to regularly test it - do not wait until it is needed as that will be when you discover some small, but critical, element has been left out and things will not work either as expected or as needed. I've also been there and done that, unfortunately more than once - some lessons need learned more than once to get through.

1 comments

While replication can be a key part of your disaster recovery plan, I think it's more often useful for operational HA, allowing you to perform database server maintenance with low (seconds) downtime. Actual real disasters where you need to fail over to the secondary in an uncontrolled manner and recovering the primary is not an option are rather rare events for any single database system.

In an actual disaster scenario, the simplest option is to always try recovering the primary database if at all possible, especially if your replication setup is a simple asynchronous one. If you just fail over to a secondary, you will have to deal with data loss from asynchronous replication due to replication delay and whatever effects that has on your application (monitor your replication lag!)

You can also set up a cluster with synchronous replicas for "true" DR, but that gets much more complicated, and honestly is likely unnecessary for most systems.