Hacker News new | ask | show | jobs
by guenthert 1653 days ago
Uh, four minutes to identify the root cause? Damn, those guys are on fire.
4 comments

Identify or to publicly acknowledge? Chances are technical teams knew about this and noticed it fairly quickly, they've been working on the issue for some time. It probably wasn't until they identified the root cause and had a handful of strategies to mitigate with confidence that they chose to publicly acknowledge the issue to save face.

I've broken things before and been aware of it, but didn't acknowledge them until I was confident I could fix them. It allows you to maintain an image of expertise to those outside who care about the broken things but aren't savvy to what or why it's broken. Meanwhile you spent hours, days, weeks addressing the issue and suddenly pull a magic solution out of your hat to look like someone impossible to replace. Sometimes you can break and fix things without anyone even knowing which is very valuable if breaking something had some real risk to you.

This sounds very self-blaming. Are you sure that's what's really going through your head? Personally, when I get avoidant like that, it's because of anticipation of the amount of process-related pain I'm going to have to endure as a result, and it's much easier to focus on a fix when I'm not also trying to coordinate escalation policies that I'm not familiar with.
:) I imagine it went like this theoretical Slack conversation:

> Dev1: Pushing code for branch "master" to "AWS API". > <slackbot> Your deploy finished in 4 minutes > Dev2: I can't react the API in east-1 > Dev1: Works from my computer

Outage started at 731 PST from our monitoring. They are on fire, but not in a good way.
It was down as of 7:45am (we posted in our engineering channel), so that's a good 40 minutes of public errors before the root cause was figured out.