Hacker News new | ask | show | jobs
by roenxi 879 days ago
Someone get Zuck an above-average-manager award.

But that isn't quite what you want in a blameless culture. The right response looks something like ignoring the engineer, gathering the tech leads and having an extremely detailed walkthrough of exactly what went wrong, how they managed to put an engineer in a position where an expensive outage happened and then they explain why it is never going to happen again. And anyone talks about disciplining the responsible engineer shout at them.

Also maybe check a month later and if anything bad happened to the engineer responsible as a result of the outage, probably sack their manager. That manager is a threat to the corporate culture.

Maybe Zuck did all that too of course. What do I know. But the story emphasises inaction and inaction after a crisis is bad.

5 comments

I would say you need to acknowledge and talk to the engineer. They will be stressed and upset, highlighting there is no blame will ease that.
They'll also be the person most able to identify what went wrong with your processes to allow the failure to occur and think through a mechanism to systematically avoid it happening again.

Also, they're probably the person least likely to make that class of mistake again. If you can keep them, you've added a lot of experiential value to your team.

Perhaps one slight amendment - maybe don't ignore the engineer, but ask them (in a separate, private meeting) if they have any thoughts on the factors that lead to it, and any ideas they have on how it could be avoided in future. Could be useful when sanity-checking the tech-leads ideas
Describing my last company’s incident process exactly.

We’d have like 3 levels of peer review on the breakdown too.

Once there was an incorrect environment variable configured for a big client’s instance which caused 2 hours of downtime (as we figured out what was wrong) and I had to write a 2 page report on why it happened.

That whole thing got tossed into our incident report black hole.

Personally I feel like the right thing to do is let the engineer closest to the incident lead the response and subsequent action items. If they do well commend them, if they don't take it seriously then it may be time to look for a new job.
Instead of a blameless culture, more desirable is a shared responsibility culture.

There are always things the engineer all the way up to the CEO could have done prior and could do after to move the company in a positive direction.

I don’t think “blameless” and “shared responsibility” are mutually exclusive, in fact, they are two halves to this same coin. The dictionary definition of “blameless” does not encompass the practical application of a “blameless” culture, which can be confusing.

The “blameless” part here means the individual who directly triggered the event is not culpable as long as they acted reasonably and per procedure. The “shared responsibility” part is how the organization views the problem and thus how they approach mitigating for the future.

When I think of “blameless” I think of “without fault”: https://www.wordnik.com/words/blameless

But when I think of “shared responsibility”, I think of everyone as sharing fault.

When something goes wrong, I think someone, somewhere likely could have mitigated it to some degree. Even if you’re following procedures, you could question the procedure if you don’t fully understand the implications. Sure, that’s a high bar, but I think it’s a preferrable to pointing the finger at the people who wrote the procedures.

On that note, someone or some group being at fault doesn’t necessitate punitive action.

> ... but I think it’s a preferrable to pointing the finger at the people who wrote the procedures ...

It is better to point the finger at the people who wrote the procedures. Their work resulted in a system failure.

If the person doing the work is expected to second guess the procedures, then there was little point having procedures in the first place, and management loses all control of the situation because they can't expect people to follow procedures any more.

Sure the person involved can literally ask questions, but after they ask questions the only option they have is to follow the procedure, so there isn't much they can do to avert problems.