Hacker News new | ask | show | jobs
by AstroJetson 253 days ago
I did a very long RCA on a problem. My management at the time was really BIG into looking at ALL THE CAUSES. They wanted HUGE fishbone diagrams to show that we had looked at everything. This was in the days of having huge drum plotters, so the diagrams could be 36" and many feet long.

So I did what they wanted and the root cause was:

On December 11 1963 Mr and Mrs Stanley Smith had sexual intercourse.

I got asked what that had to do with anything and I said, "If you look up a few lines you'll see that the issue was a human error caused by Bob Smith, if he hadn't been born we wouldn't have had this problem and I just went back to the actual conception date."

I got asked how I was able to pin it to that date and said "I asked Bob what his father's birthday was and extrapolated that info"

I was never asked to do a RCA again.

1 comments

I model RCAs after my understanding of NTSB and incident response after my understanding of NASA command centers.

They're both flawed but often replace something that works 3x worse than my caricature of both.

The findings should always result in a material change that is worth at least the effort of having done it. Not just a checkbox that proves we did something. The investment in the mitigation should honor the consequences of the failure, and the uniqueness of the failure. Or rather, the lack of uniqueness. As a failure repeats in kind (eg, a bunch of 737 Maxes crashing), trust that the system is put in jeopardy. By the time a problem has happened three times, the response should begin to resemble penance.

So how do we get the problem not to hit production again, or how do we at least keep it happening due to the exact same error?

And for some failure modes, we need to project the consequences going forward. Let's say you find your app is occasionally crashing over a weekend because of memory leaks, plus the lack of Continuous Deployment forcibly restarting the services. We can predict this problem will happen reliably on Memorial Day, and Labor Day. So we need to do something relatively serious now.

But it'll also get much worse on Thanksgiving weekend, and just stupid around Christmas, when we have code freezes. So we do something to get us through Memorial Day but we also need a second story near the top of the backlog that needs to be done by Labor Day, Thanksgiving at the latest. But we don't necessarily have to do that story next sprint.