Hacker News new | ask | show | jobs
by tetha 1610 days ago
After a certain length of outage, you have to start prioritizing differently though. I only have our own anecdotes there. But if someone was at a problem for 8 - 12 consecutive hours under pressure, the quality of their work is going to drop sharply. At such a point, it becomes more and more likely for them to make the situation worse instead of fixing it.

And at or beyond that point, you pretty much have to take inspiration from fire fighters and emergency services: You need to organize the experts on subsystems to rest and sleep in shifts, ideally during simpler but time consuming tasks. Otherwise these persons will crash and you lose their skills and knowledge during that outage for good. And that might render an outage almost impossible to handle.

1 comments

I think I didn't explain myself very well: clearly on-duty must sleep if it's a multi-day incident, but they also need extra help when they are awake! If the business is completely down, there isn't normal work to do for other engineers so, even if they are out of their typical domain, they might give good insights, novel ideas or fix some side issues that will help the ones with more domain knowledge.
The problem is that you don’t know how long the outage will be when it starts. I once saw a large outage start, everyone jumped on to troubleshoot, thinking it would be an hour. 8 hours later it’s still an outage, and everyone is still on and burned out. Management should have told half the people who jumped on at the start to go away and be prepared for a phone call in 8 hours to provide relief.