|
|
|
|
|
by noop_joe
638 days ago
|
|
One of the most difficult challenges with incidents is dispelling the initial conjecture. Something bad happens and a lot of theories flood the discussion. Engineers work to prove or disprove those theories, but the story about one might take on a life of its own outside the dev team. What then ends up happening is post-incident there's a lot of work to not only show that the problem was the result of XYZ, but also it definitely wasn't the result of ABC. I was responsible for wsj.com for a few years. The homepage, articles and section fronts were considered dial-tone services (cannot under any circumstances go down). My job was to lead the transition from the on-prem site to the redesigned cloud site. As you can imagine there were a few hiccups along that journey. One particular incident we encountered was when reporters broke the news of a few unrelated industry computer system failures (including finance). Because it was about a financial system, people visited wsj, the spike in traffic was so large it knocked us out. Now other news outlets were reporting wsj down. Unfortunately, there was a perception that these incidents were all related by a coordinated hacking event. Each minute the site had an interruption of service, I would need to spend hours post-incident making sure the causes were understood, verified and stakeholders knew what they were. All in all, the on-call experiences were fine. Sure people were tired if they happened in them middle of the night, but the team was supportive and there was a culture of direct problem solving that didn't add _extra_ stress. Fun stuff. |
|