Hacker News new | ask | show | jobs
by yazaddaruvala 1610 days ago
Typically:

Yes. This was a multi-day outage and eventually the oncall does need sleep, so you need more of the team to help with it. Typically, at any reasonable team, everyone that chipped in nights get to take off equivalent days and sprint tasks are all punted.

Yes. Not just to manage risks, but also to get quick prioritization from all teams at the company. "You need legal? Ok, meet ..." "You need string translations? Ok escalated to ..." "You need financial approval? Ok, looped in ..."

Kinda. Definitely would have represented crunch time, but a very very demoralizing crunch time. Managers also try to insulate most of their teams from it, but everyone pays attention anyways. Keep in mind these typically only last an hour or 3, at most they last a few days, so there is no "core team" other than the leadership structure from your question 2. Otherwise, it is very much "people/teams helping as needed".

1 comments

> Yes. This was a multi-day outage and eventually the oncall does need sleep, so you need more of the team to help with it.

Well, also your business is 100% down, all the capable engineering eyes should be looking at the issue.

After a certain length of outage, you have to start prioritizing differently though. I only have our own anecdotes there. But if someone was at a problem for 8 - 12 consecutive hours under pressure, the quality of their work is going to drop sharply. At such a point, it becomes more and more likely for them to make the situation worse instead of fixing it.

And at or beyond that point, you pretty much have to take inspiration from fire fighters and emergency services: You need to organize the experts on subsystems to rest and sleep in shifts, ideally during simpler but time consuming tasks. Otherwise these persons will crash and you lose their skills and knowledge during that outage for good. And that might render an outage almost impossible to handle.

I think I didn't explain myself very well: clearly on-duty must sleep if it's a multi-day incident, but they also need extra help when they are awake! If the business is completely down, there isn't normal work to do for other engineers so, even if they are out of their typical domain, they might give good insights, novel ideas or fix some side issues that will help the ones with more domain knowledge.
The problem is that you don’t know how long the outage will be when it starts. I once saw a large outage start, everyone jumped on to troubleshoot, thinking it would be an hour. 8 hours later it’s still an outage, and everyone is still on and burned out. Management should have told half the people who jumped on at the start to go away and be prepared for a phone call in 8 hours to provide relief.