| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by chucky_z 3988 days ago

"The tricky part is then notifying the hell out of everyone who needs to be notified that something really bad has happened, a failover occurred, everything is OK, but it needs some attention ASAP."

If there is a need for multiple 9's of uptime, there should be an escalation process for these kinds of events, which will probably include 24/7 on-call rotations.

Even if the problem is entirely self-resolving, it should still be looked at by more than one system. It should be noted, observed, documented, and confirmed it's truly resolved. That system is usually a human, but it doesn't necessarily have to be.