Hacker News new | ask | show | jobs
by tacLog 1610 days ago
> Big props to the on-calls during this.

Kind of curious about this. I know this is probably company specific but how do outages get handled at large orgs? Would the on-calls have been called in first then called in the rest of the relevant team?

Is their a leadership structure that takes command of the incident to make big coordinated decisions to manage the risk of different approaches?

Would this have represented crunch time to all the relevant people or would this be a core team with other people helping as needed?

8 comments

Typically:

Yes. This was a multi-day outage and eventually the oncall does need sleep, so you need more of the team to help with it. Typically, at any reasonable team, everyone that chipped in nights get to take off equivalent days and sprint tasks are all punted.

Yes. Not just to manage risks, but also to get quick prioritization from all teams at the company. "You need legal? Ok, meet ..." "You need string translations? Ok escalated to ..." "You need financial approval? Ok, looped in ..."

Kinda. Definitely would have represented crunch time, but a very very demoralizing crunch time. Managers also try to insulate most of their teams from it, but everyone pays attention anyways. Keep in mind these typically only last an hour or 3, at most they last a few days, so there is no "core team" other than the leadership structure from your question 2. Otherwise, it is very much "people/teams helping as needed".

> Yes. This was a multi-day outage and eventually the oncall does need sleep, so you need more of the team to help with it.

Well, also your business is 100% down, all the capable engineering eyes should be looking at the issue.

After a certain length of outage, you have to start prioritizing differently though. I only have our own anecdotes there. But if someone was at a problem for 8 - 12 consecutive hours under pressure, the quality of their work is going to drop sharply. At such a point, it becomes more and more likely for them to make the situation worse instead of fixing it.

And at or beyond that point, you pretty much have to take inspiration from fire fighters and emergency services: You need to organize the experts on subsystems to rest and sleep in shifts, ideally during simpler but time consuming tasks. Otherwise these persons will crash and you lose their skills and knowledge during that outage for good. And that might render an outage almost impossible to handle.

I think I didn't explain myself very well: clearly on-duty must sleep if it's a multi-day incident, but they also need extra help when they are awake! If the business is completely down, there isn't normal work to do for other engineers so, even if they are out of their typical domain, they might give good insights, novel ideas or fix some side issues that will help the ones with more domain knowledge.
The problem is that you don’t know how long the outage will be when it starts. I once saw a large outage start, everyone jumped on to troubleshoot, thinking it would be an hour. 8 hours later it’s still an outage, and everyone is still on and burned out. Management should have told half the people who jumped on at the start to go away and be prepared for a phone call in 8 hours to provide relief.
Google has his Site Reliability Engineering book, which might answer some of your questions

https://sre.google/sre-book/table-of-contents/

It is an interesting read. Here's the pdf:

https://github.com/captn3m0/google-sre-ebook/releases/downlo...

Is this the same as the O'Reilly dead tree book of the same name?
Yes.
Oncalls get paged first and then escalate. As they assess impact to other teams and orgs, they usually post their tickets to a shared space. Once multiple team/org impact is determined, leadership and relevant ops groups (networking, eg) get pulled in to a call. A single ticket gets designated the Master Ticket for the Event, and oncalls dump diagnostic info there. Root cause is found (hopefully), affected teams work to mitigate while RC team rushes to fix.

The largest of these calls I've seen was well into the hundreds of sw engineers, managers, network engineers, etc.

Wow, that makes complete sense for something that is impacting this many people and by extension lots of money.

Thanks for the answer, I have only ever worked with such a small team that we are all on a call every day.

I can imagine it can probably get a little hectic in large group calls? On the engineering side is there a command structure? Like say the root cause was found and RC team is rushing to fix it. But another team wants to mitigate in the mean time in a slightly risky way. Would their manager make a case with leadership? Would the proposed plan just be put out for general comment as a response to that main ticket?

It depends. I’ve managed major incidents with hundreds of participants.

Our major incident process generally had a “suit” call with non-technical executives and people who would be coordinating customer triage, outreach, etc. Then we would have a tech bridge where the key stakeholders did their thing.

We used the Federal incident command system as a model. It’s a great reference point to use as an inspiration.

Any guides on the "Federal incident command system" to read from (e.g. without blindly googling for it). Thanks?
In addition, you can look into ITIL/ITSM Incident Management plans, they have well developed process structure to work from as a guideline.

I have also seen organizations recommend Kepner Tregoe method training for real time high pressure problem solving based off Nasa Mission Control systems.

https://training.fema.gov/nims/ is a great entry point.
each company is different, from my experience it would depend on the severity of the fix, and the severity of the issue. the problem would get resolved by any means ie temporary sticky plaster if necessary.

Another team would then assess and analyse the root cause from a company wide perspective and then assess the risks, costs and impact and then make any modifications (possibly redoing the temporary fix, and fixing it properly)

Real issue, a call center main telephony system and one of the management servers kept crashing causing over 1400 call center people to stop working. Temporary fix was to re boot the servers every 4 hours causing minor pain, but the call staff was up and running.

After a whole stupid week of the engineers not being able to find the route cause it was escalated extremely high and our team was brought in and we found the root cause in seconds (literally)The servers was VMs and the engineers hadn't checked the physical ESX server they were hosted on. another VM on the box caused the server to go unstable (ESX not configured correctly).

BAU project set up to audit/ report and fix all the ESX servers in the company for other stupid config issues

The person you're responding to is not exactly wrong. But since the users dropped to 0 pretty quickly it's likely that every team with any monitoring at all got paged. At least that's what would happen at the moderately large company I work for.
I'm giving a much broader example of what a large company might do for high impact events. I have no idea what the insides of Roblox look like specifically.
Not to mention a VP or three. A well-led company is going to have management in the line of fire, so to speak, so an outage of this scale would wake them as well.
So 1-3 people actually figure it out while everyone else gets in the way? There's no way hundreds of engineers, managers, network engineers etc. can get anything actually done as a group, right?
Former Google SRE here, I can share my experience although I've never been involved in a large serious outage (thankfully). I've had my fair share of smaller multi-team outages though.

Usually the way it works is so that we have multiple clearly-identified and properly-handed-off roles. There's an Incident Commander (IC) role, whose job is to basically oversee the whole situation, there's various responders (including a primary one) whose job is to mitigate/fix the problems usually relating their own teams/platform/infra (networking, security, virtualization clusters, capacity planning, logging, etc. depends on the outage). There's also sometimes a communication person (I forget the role name specifically) whose job is to keep people updated, both internal to the outage (responders, etc) and outsiders (dealing with public-facing comms, either to other internal teams affected by the outage or even external customers).

Depending on the size of the outage, the IC may establish a specific "war room" channel (used to be an IRC chatroom, not sure what they use these days though) where most communication from various interested parties will take place. The advantage of a chatroom is that it lets you maintain communication logs and timestams (useful for postmortem and timeline purposes), and it helps when handing off to the next oncaller during a shift change (they can read the history of what happened).

> There's no way hundreds of engineers, managers, network engineers etc. can get anything actually done as a group, right?

Most people will not really be doing much but when you need to diagnose a problem, having a lot of brains with various expertise in different domains helps, especially if those people are the ones that have implemented a certain service that might be obscure to the other oncallers. Generally speaking, it wouldn't be unheard of to have 30-40 people in the same irc channel brainstorming and coordinating a cross-team effort to mitigate a problem, but into the hundreds? Not quite sure about that much.

Just my two cents. You can probably get more info by reading the Google SRE book https://sre.google/books/

Yeah, I've read the Google SRE book and the product I work on follows Google's SRE model. Sometimes I wonder though if it's all one big anti-pattern. Maybe more precisely it's a pattern designed to work even if nobody knows what's going on. Things are so vastly (over?) complicated. The original designers are long gone. But you still somehow have to keep things going and address any issues that pop up. In our org that SRE model leads that some very weird things because the SREs know the infrastructure (to some degree) but don't really understand the stuff running over it. But I guess we're delivering the service so that's something.

I think the "real world" doesn't work like that. The way the real world works is that things are decoupled in a way that one system's failure doesn't bring the entire world down. So things can be solved in isolation by people that actually understand the system and/or systems are designed in a way that they are serviceable etc.

When the power fails in my neighbourhood, you don't get 100 engineers on a hotline, one van comes down, troubleshoots the problem, and fixes it. Like 3 technicians.

I know there are some exceptions like some power failures that cascaded or the global supply shortages. But those are design failures IMO. A computer system that goes down for this length of time and nobody can figure out why or recover, that seems like a total failure to me on multiple levels. We're just doing this wrong.

Speaking from personal experience, most outages are contained and mitigated within a specific service before they end up impacting other services too. Cascade effects are rare, you just notice them more often because they affect multiple people and usually external-facing customers too. In reality, most things will (or, rather, *should*) page you well before it becomes a cascade-effect incident that multiple teams will have to take care of.

If your problem is that nobody knows what's going on and that stuff constantly brings down a bunch of different systems, you either need to finetune your alerting so the affected system tells you something is wrong *before* it reaches other people (monitor your partial rollouts, canary releases, capacity bursts, etc), or you have a problem with playbooks.

The person that implemented the system doesn't need to be the person that fixes it in case there's a problem. We have playbooks that tell us exactly what to do, where to go, which flags to flip, which machine to bring down/bring up, etc in case of various problems. These should be written by the person that implemented the system and any following SRE who's been in charge of fixing bugs or finding issues as a way for the next SRE oncall to not be lost when navigating that space. Remember that the person oncall is not the one responsible for fixing the issue, they are the person responsible for mitigating the problem until the most appropriate person can fix it (preferably not outside working hours).

Again, there can be exceptions that require multiple engineers to work together on multiple services, but in reality that should not be the norm. Most of the pages I handled as an SRE were "silly" things that were self-contained to our team and our service and our customers never even noticed anything was wrong in the first place.

In a really large company, you're talking maybe ~100-200 people per org. EC2 alone has a massive footprint, for instance. Hundreds of engineers, of whom a dozen are maybe oncall for their respective components. If something goes wrong in, let's say' cloudwatch, but EC2 is impacted, that's dozens of people working to weight their services out of the impacted AZ, change cache settings, bounce fleets, etc.

A lot of the time root cause is solved by a smaller number of people. But identifying root cause and mitigating impact during an event -- and then communicating specifics of that impact -- can fall to a much larger group.

If 1-3 people are actively solving the issue, they do so alone, and give periodic updates to the broader group through a manager or other communication liason.

3 people to fix the Vital Component That Must Work At All Times.

97 people to check/restart/monitor their team's system, because the Vital Component has never failed before so their graceful recovery code is untested or nonexistent.

For the on call system that I ran until recently, there are about a dozen on call teams responsible for parts of the service. Each team has a primary and backup engineer, generally on a 7x24 shift that lasts a week. Most weeks it's not very busy.

Working with them during an incident is an on call comms lead, who handles outside-of-team comms (protecting the engineers), and an engineering lead (who is a consultant, advisor, and can approve certain actions).

For big incidents, an exec incident manager is involved. They primarily help with getting resources from other teams.

Where I work there is an incident team that handles things like creating a master ticket, starting a call bridge, getting the on-calls into the bridge, keeping track of what teams (and who from those teams) have been brought in, manages the call (keeping chatter down and focused when there are 100 people in a call is important), periodically comments on the master ticket with status and a list of impacted teams, marks down milestone times like when the impact started, when it was detected, mitigated, root cause found, etc. This person is also responsible for stuff like when they hear you want to engage team X, they'll go track down an on-call for you, or summarizing known impact for the outward-facing status pages, etc. They also create the postmortem template and follow up with all involved teams to get them to contribute their detailed impact statement there.

Edit: sometimes when it's a really gnarly problem and there are huge numbers of people on the call, the set of people who are actively trying to come up with mitigations and need to just be able to talk freely at each other will break off into a less noisy call and leave a representative to relay status to the main call.

Approaches vary company-to-company, but https://response.pagerduty.com/ is a good resource for understanding how it often looks.
At Google an oncaller typically gets paged, triages the incident and, if it's bad, they page other oncallers and or team members for help. For more serious incidents, people take on different roles like communications lead, incident commander etc.

During the worst outage I was a involved in basically the entire org including all of the most senior engineers worked around the clock for two weeks to fix everything

The on calls ARE the relevant team lol. You're doing it wrong otherwise
As someone with 8 years of experience in SRE in Google: I wouldn't be so sure about that. Most outages require only rudimentary understanding of the particular service. Pretty much "have you tried turning it off and on?", with the extra step of figuring out which piece of the stack needs the kick. Hence, there are many SRE teams that onboard lots of services with this kind of half-support. The on call only performs generic investigation and repair attempts. If that doesn't help, they escalate to the relevant dev team, who likely will only respond in office hours.

Only the important services get dedicated oncalls. Most important ones will have both 24/7 SRE and dev oncalls.

What processes are there (and how effective are they?) to determine if a non-expert SRE should fix something there-and-then (and potentially making things worse) vs. assigning it to a dev team for a correctly engineered fix, at the cost of delays?