| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by Morgawr 1614 days ago

Speaking from personal experience, most outages are contained and mitigated within a specific service before they end up impacting other services too. Cascade effects are rare, you just notice them more often because they affect multiple people and usually external-facing customers too. In reality, most things will (or, rather, *should*) page you well before it becomes a cascade-effect incident that multiple teams will have to take care of.

If your problem is that nobody knows what's going on and that stuff constantly brings down a bunch of different systems, you either need to finetune your alerting so the affected system tells you something is wrong *before* it reaches other people (monitor your partial rollouts, canary releases, capacity bursts, etc), or you have a problem with playbooks.

The person that implemented the system doesn't need to be the person that fixes it in case there's a problem. We have playbooks that tell us exactly what to do, where to go, which flags to flip, which machine to bring down/bring up, etc in case of various problems. These should be written by the person that implemented the system and any following SRE who's been in charge of fixing bugs or finding issues as a way for the next SRE oncall to not be lost when navigating that space. Remember that the person oncall is not the one responsible for fixing the issue, they are the person responsible for mitigating the problem until the most appropriate person can fix it (preferably not outside working hours).

Again, there can be exceptions that require multiple engineers to work together on multiple services, but in reality that should not be the norm. Most of the pages I handled as an SRE were "silly" things that were self-contained to our team and our service and our customers never even noticed anything was wrong in the first place.