Hacker News new | ask | show | jobs
by YZF 1612 days ago
Yeah, I've read the Google SRE book and the product I work on follows Google's SRE model. Sometimes I wonder though if it's all one big anti-pattern. Maybe more precisely it's a pattern designed to work even if nobody knows what's going on. Things are so vastly (over?) complicated. The original designers are long gone. But you still somehow have to keep things going and address any issues that pop up. In our org that SRE model leads that some very weird things because the SREs know the infrastructure (to some degree) but don't really understand the stuff running over it. But I guess we're delivering the service so that's something.

I think the "real world" doesn't work like that. The way the real world works is that things are decoupled in a way that one system's failure doesn't bring the entire world down. So things can be solved in isolation by people that actually understand the system and/or systems are designed in a way that they are serviceable etc.

When the power fails in my neighbourhood, you don't get 100 engineers on a hotline, one van comes down, troubleshoots the problem, and fixes it. Like 3 technicians.

I know there are some exceptions like some power failures that cascaded or the global supply shortages. But those are design failures IMO. A computer system that goes down for this length of time and nobody can figure out why or recover, that seems like a total failure to me on multiple levels. We're just doing this wrong.

1 comments

Speaking from personal experience, most outages are contained and mitigated within a specific service before they end up impacting other services too. Cascade effects are rare, you just notice them more often because they affect multiple people and usually external-facing customers too. In reality, most things will (or, rather, *should*) page you well before it becomes a cascade-effect incident that multiple teams will have to take care of.

If your problem is that nobody knows what's going on and that stuff constantly brings down a bunch of different systems, you either need to finetune your alerting so the affected system tells you something is wrong *before* it reaches other people (monitor your partial rollouts, canary releases, capacity bursts, etc), or you have a problem with playbooks.

The person that implemented the system doesn't need to be the person that fixes it in case there's a problem. We have playbooks that tell us exactly what to do, where to go, which flags to flip, which machine to bring down/bring up, etc in case of various problems. These should be written by the person that implemented the system and any following SRE who's been in charge of fixing bugs or finding issues as a way for the next SRE oncall to not be lost when navigating that space. Remember that the person oncall is not the one responsible for fixing the issue, they are the person responsible for mitigating the problem until the most appropriate person can fix it (preferably not outside working hours).

Again, there can be exceptions that require multiple engineers to work together on multiple services, but in reality that should not be the norm. Most of the pages I handled as an SRE were "silly" things that were self-contained to our team and our service and our customers never even noticed anything was wrong in the first place.