Hacker News new | ask | show | jobs
by jmaestrooper 247 days ago
>Is 75 minutes really considered that long of a time?

From my experience in setting up and running support services, not really. It's actually pretty darn quick.

First, the issue is reported to level 1 support, which is bunch of juniors/drones on call, often offshore (depending on time of the day) who'll run through their scripts and having determined that it's not in there, escalate to level 2.

Level 2 would be more experienced developer/support tech, who's seen a thing or two and dealt with serious issues. It will take time to get them online as they're on call but not online at 3am EST, as they have to get their cup of joe, turn on the laptop etc. Would take them a bit to realize that the fecal matter made contact with the rotating blades and escalate to level 3.

Which involves setting up the bridge, waking up the decisions makers (in my case it was director and VP level), and finally waking up the guy who either a) wrote all this or b) is one of 5 or 6 people on the planet capable of understanding and troubleshooting the tangled mess.

I do realize that AWS support might be structured quite a bit differently, but still... 75 minutes is pretty good.

Edit: That is not to say that AWS doesn't have a problem with turnover. I'm well aware of their policies and tendency to get rid of people in 2/3 years, partially due to compensation structures where there's a significant bump in compensation - and vesting - once you reach that timeframe.

But in this particular case I don't think support should take much of a blame. The overall architecture on the other hand...

3 comments

Sorry, are you saying you worked at Amazon and this is how they handle major outages? Just snooze and wait for a ticket to make its way up from end user support? No monitoring? No global time zone coverage?

Because if so, this seems like about the most damning thing I could learn from this incident.

No, it's just mindless speculation from someone who clearly hasn't worked a critical service's on call rotation before. Not at all what it's actually like, all these services have automatic alarms that will start blaring and firing pagers, and once scope of impact is determined to be large escalations start happening extremely quickly paging anyone even possibly able to diagnose the issue. There's also crisis rotations staffed with high level ICs and incident managers who will join ASAP and start directing the situation, you don't need to wait for some director or VP.
I worked at AWS (EC2 specifically), and the comment is accurate.

Engineers own their alarms, which they set up themselves during working hours. An engineer on call carries a "pager" for a given system they own as part of a small team. If your own alert rules get tripped, you will be automatically paged regardless of time of day. There are a variety of mechanisms to prioritize and delay issues until business hours, and suppress alarms based on various conditions - e.g. the health of your own dependencies.

End user tickets can not page engineers but fellow internal teams can. Generally escalation and paging additional help in the event that one can not handle the situation is encouraged and many tenured/senior engineers are very keen to help, even at weird hours.

“There are a variety of mechanisms to prioritize and delay issues until business hours”

What are business hours for a global provider of critical tech services?

Business hours for the team receiving the alarm; many issues can wait to be resolved during your own waking hours if they are not impacting customers.
"This is important enough for someone to work on as soon as their shift starts, but not important enough to page someone out of bed for."
AWS operates completely than what you're describing.

Alerts and monitoring will results in automatic pages to engineers. There is no human support before it gets escalated.

If an engineer hasn't taken a look within a few minutes, it escalates to their manager, and so on.

Wholly inaccurate. AWS Systems Engineers would have been paged by automated monitoring systems once alert thresholds were breached. No escalation through Support needed.