Hacker News new | ask | show | jobs
by WaxProlix 1612 days ago
In a really large company, you're talking maybe ~100-200 people per org. EC2 alone has a massive footprint, for instance. Hundreds of engineers, of whom a dozen are maybe oncall for their respective components. If something goes wrong in, let's say' cloudwatch, but EC2 is impacted, that's dozens of people working to weight their services out of the impacted AZ, change cache settings, bounce fleets, etc.

A lot of the time root cause is solved by a smaller number of people. But identifying root cause and mitigating impact during an event -- and then communicating specifics of that impact -- can fall to a much larger group.

If 1-3 people are actively solving the issue, they do so alone, and give periodic updates to the broader group through a manager or other communication liason.