| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by hinkley 253 days ago

I model RCAs after my understanding of NTSB and incident response after my understanding of NASA command centers.

They're both flawed but often replace something that works 3x worse than my caricature of both.

The findings should always result in a material change that is worth at least the effort of having done it. Not just a checkbox that proves we did something. The investment in the mitigation should honor the consequences of the failure, and the uniqueness of the failure. Or rather, the lack of uniqueness. As a failure repeats in kind (eg, a bunch of 737 Maxes crashing), trust that the system is put in jeopardy. By the time a problem has happened three times, the response should begin to resemble penance.

So how do we get the problem not to hit production again, or how do we at least keep it happening due to the exact same error?

And for some failure modes, we need to project the consequences going forward. Let's say you find your app is occasionally crashing over a weekend because of memory leaks, plus the lack of Continuous Deployment forcibly restarting the services. We can predict this problem will happen reliably on Memorial Day, and Labor Day. So we need to do something relatively serious now.

But it'll also get much worse on Thanksgiving weekend, and just stupid around Christmas, when we have code freezes. So we do something to get us through Memorial Day but we also need a second story near the top of the backlog that needs to be done by Labor Day, Thanksgiving at the latest. But we don't necessarily have to do that story next sprint.