| I read it and I couldn't shake the feeling of an opinionated piece. It's very likely that the definition of what encompasses Root Cause Analysis (RCA) differs between the author and me. It's likely that he sees the RCA as a top-down managerial tool that is used to find fault or point fingers, as opposed to a good-will exercise between team members (spanning one or few teams) that is used to find areas that are more "fragile" to the current processes and evaluate if it's possible to strengthen them and make them more resilient. In this context, RCA becomes very similar to the other concept used in the article which is "Post-incident review". RCA imo also helps define better the responsibilities and areas of expertise of different members and what would be acceptable actions and escalations in similar scenarios. >In fact the person who wrote the offending line of configuration had recently adopted a new cat, who had been keeping them up late the night before, so perhaps the real ‘root cause’ is the cat?
The RCA would help identify the person as a potential single-point of failure and evaluate if additional things can be introduced to make it more resilient and reduce the probability of the issue happening (have a pair make the change, or have better testing before implementing it, etc). That said, RCA should also take into account the rarity of the issue, and sometimes the conclusion is that no change to the process, etc is required since it would introduce more complexity and potential weaknesses in other areas, than it would provide additional benefits. In addition, I find the following statement a bit too strong
> let’s differentiate between the ‘root cause’ and Least Effort to Remediate (LER). As long as an incident is ongoing, LER is absolutely the right thing to pursue. When the building is on fire, put the fire out as quickly as possible. This assumes that all incidents are of "building on fire" type, and that's not true in my book. If the author assumes that RCA is used only for critical issues, that's not the same definition I've used. The course of action to solve various incidents is a "it depends..", such as on the gravity of the issue and what is the available timeframe to fix it, whether LER is something that you have already identified or that you need to find, the level of effort for LER vs identifying root cause and the potential for the applied remediation to cause other issues... heck you might even decide that it's better if you just let the building burn then try to put out the fire... Despite the fact that there were various areas where I didn't agree with the conclusions that the author reached, I'm thankful for the effort that the author has put in this piece and for making public his point of view on this topic. |