|
|
|
|
|
by minimaul
964 days ago
|
|
> Our team was all-hands-on-deck and had worked all day on the emergency, so I made the call that most of us should get some rest and start the move back to PDX-04 in the morning. That decision delayed our full recovery, but I believe made it less likely that we’d compound this situation with additional mistakes. I liked this - the human element is underemphasised often in these kinds of reports, and trying to fix a major outage while overly tired is only going to add avoidable mistakes. I don’t know how it would work for an org of Cloudflare’s size, but I know we have plans for a significant outage for staff to work/sleep in shifts, to try to avoid that problem as well. Issue there is that you need a way to hand over the current state of the outage to new staff as they wake up/come online. |
|
Like Mike Tyson says, everyone has a plan until they get punched in the face.