| HN Mirror

Yes, and the selection bias goes even deeper: The simplest possible failover logic is "if something is wrong with our ability to talk to the database, try the secondary database". But, in that case, you almost never see a broken website running on its primary database. Inevitably, the site has already tried failing over to the secondary before it gives up and yells for help.

On the one hand, this is not a good thing, because if you've got a problem that's unrelated to the database (e.g. too much traffic is choking up your supply of DB connections) and then you do a failover, now you have two problems - or, at least, more moving parts to sort out before the situation is resolved. So it's tempting to design a more clever failover scheme. But, on the other hand, cleverness is itself a risk: Not only might your clever algorithm have an even-more-clever pathological failure mode, but it's harder to understand in an emergency. When your stuff is broken, simplicity is your friend. All else being equal, you don't want your front-line emergency responder to have to understand complex failover logic. There is nobody more frustrated than an ops engineer who can't make the system use the primary database because some stupid bot keeps forcing the use of the secondary, or vice versa. In the heat of battle, they're liable to comment out your clever bot and replace it with a one-line shell script.

Engineering is a difficult balancing act.