| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by palcu 1191 days ago

There's not much emotion as the core team working on the huge outages is more like an "SRE for SRE". They are all people who've been with the company for a long time and they've been in the secondary seat for at least one previous big rodeo. Not to mention that we're all running a checklist that has been exercised multiple times and there's always somebody on the call who could help if a step fails.

Personally, I wasn't part this time for the actual mitigation of the overall Paris DC recovery, as I was busy with an unfortunate[0] side effect of the outage. These generate more anxiety, as being woken up at 6am and being told that nobody understands exactly why the system is acting this way is not great. But then again, we're trained for this situation and there are always at least several ways of fixing the issue.

Finally, it's worth repeating that incident management is just a part of the SRE job and after several years I've understood that it is not the most important one. The best SREs I know are not great when it comes to a huge incident. But, they're work has avoided the other 99 outages that could have appeared on the front page of Hacker News.

[0]: https://news.ycombinator.com/item?id=35734224

2 comments

Waterluvian 1191 days ago

I appreciate your insight into this. Thanks!

link

throwbigdata 1191 days ago

Who trains the trainers?

link

palcu 1190 days ago

Life and experience, if you're looking for a short answer. For example, last year we had an outage in London[0] and the folks who worked on it learnt a lot. Now, they applied the learnings in this incident.

[0]: https://news.ycombinator.com/item?id=32161755

link