|
|
|
|
|
by chucky_z
3988 days ago
|
|
"The tricky part is then notifying the hell out of everyone who needs to be notified that something really bad has happened, a failover occurred, everything is OK, but it needs some attention ASAP." If there is a need for multiple 9's of uptime, there should be an escalation process for these kinds of events, which will probably include 24/7 on-call rotations. Even if the problem is entirely self-resolving, it should still be looked at by more than one system. It should be noted, observed, documented, and confirmed it's truly resolved. That system is usually a human, but it doesn't necessarily have to be. |
|