Hacker News new | ask | show | jobs
by hinathan 5035 days ago
This feels like a pretty standard pattern for a lot of services — fail, come back up on backup DB, fail again when backup proves to not be capable of handling the surge of load, then eventually come back up on the primary DB once people have gotten bored and stopped hitting 'refresh'.

Is that a function of not prewarming failover DBs, or is there something pathological about the primary-secondary pattern?

3 comments

Maybe we just don't hear about situations when the backup/secondary server succeeds in picking up the slack because it would be transparent to the end user?
Selection bias, good point.
Yes, and the selection bias goes even deeper: The simplest possible failover logic is "if something is wrong with our ability to talk to the database, try the secondary database". But, in that case, you almost never see a broken website running on its primary database. Inevitably, the site has already tried failing over to the secondary before it gives up and yells for help.

On the one hand, this is not a good thing, because if you've got a problem that's unrelated to the database (e.g. too much traffic is choking up your supply of DB connections) and then you do a failover, now you have two problems - or, at least, more moving parts to sort out before the situation is resolved. So it's tempting to design a more clever failover scheme. But, on the other hand, cleverness is itself a risk: Not only might your clever algorithm have an even-more-clever pathological failure mode, but it's harder to understand in an emergency. When your stuff is broken, simplicity is your friend. All else being equal, you don't want your front-line emergency responder to have to understand complex failover logic. There is nobody more frustrated than an ops engineer who can't make the system use the primary database because some stupid bot keeps forcing the use of the secondary, or vice versa. In the heat of battle, they're liable to comment out your clever bot and replace it with a one-line shell script.

Engineering is a difficult balancing act.

Failing over to the secondary only helps if the problem is local to the primary. If you pushed bad code or the system just can not handle the load, the secondary will just fail in the same way.
This is why Chaos Monkeys pay dividends.