Hacker News new | ask | show | jobs
by kixiQu 101 days ago
> If the payment service went down because a config value was wrong, the incident report should say: the payment service went down because config value X was set to Y when it needed to be set to Z.

The number of junior engineers I have had to coach out of this way of thinking to get the smallest fragment of value out of a postmortem process... dear Lord. I wonder if this person is similarly new to professional collaboration.

The larger personal site is very aesthetically cool, though – make sure you click around if you haven't!

2 comments

Yeah, I wonder if the author has been in a situation where a brief explanation was taken by a higher up (or a cc'd higher-up x2, or x3) as "It was entirely my fault and I'm withholding details that would further implicate me and giving only the facts that don't."

I've had to work to balance emails like this between "they don't want the nitty gritty, they just want to be satisfied the issue is solved" and "They will definitely want the nitty gritty and think something is up if the details seems suspiciously sparse". Especially if the recipients are technical, and they know that you know that they're technical. what are you hiding, Qaadika? you're usually more verbose than this.

What's the mistake here? Shouldn't an incident report start with this and then continue with an analysis of the process, without too much "internal perspective"?

In my mind, the internal perspective might be useful to jot down when doing the analysis, but is too noisy to be useful to disseminate.

So I know it's a little bananas to answer this with a link to material the length of a novel, but my feeling is that the real spirit of a postmortem is best carried across by:

https://www.hillelwayne.com/post/stamping-on-eventstream/

He goes through the process, which he describes:

> The constant zooming-out is key here: it’s not enough to find out why things broke, but find out why “why things broke”. In theory you’re supposed to keep doing it: if someone skips a step because of managerial pressure, you ask why the manager was pressuring them in the first place. If the manager was worried about production quotas, find out how the quotas were decided. You just keep going and going and going.

There are different procedures folks can use to capture bits of this to different degrees, but I think this write-up illustrates well both how exhausting it is to do this right and what the value can be. Even if your goal is to get to Action Items, this kind of understanding of your event is what should generate them.

If a person doesn't understand the value, I would imagine they would write something very close to TFA's

> when something goes wrong [...] they explain why they made the decision, and then explain the contextual factors that influenced that, and then explain why those contextual factors existed, and then explain why it would have been unreasonable to expect them to anticipate the downstream effect of those factors, and by the end you have some fat five paragraphs that contains maybe one sentence worth of information and reads like a legal defense brief written by someone who knows they are guilty.