Hacker News new | ask | show | jobs
by chucky_z 3988 days ago
"The tricky part is then notifying the hell out of everyone who needs to be notified that something really bad has happened, a failover occurred, everything is OK, but it needs some attention ASAP."

If there is a need for multiple 9's of uptime, there should be an escalation process for these kinds of events, which will probably include 24/7 on-call rotations.

Even if the problem is entirely self-resolving, it should still be looked at by more than one system. It should be noted, observed, documented, and confirmed it's truly resolved. That system is usually a human, but it doesn't necessarily have to be.