Hacker News new | ask | show | jobs
by zby 2532 days ago
So the article identifies a software bug and a software/config bug as the root cause. That sounds a bit shallow for such a high visibility case - I was expecting something like the https://en.wikipedia.org/wiki/5_Whys method with subplots on why the bugs where not caught in testing. By the way I only clicked on it because I was hoping it would be an occasion to use the methods from http://bayes.cs.ucla.edu/WHY/ - alas no - it was too shallow for that.
1 comments

It is likely that this RCA was shallow because it was intended for everyone--including non-technical users, who (at least in my experience) tend to misinterpret or get confused by deep technical or systemic failure analysis.

It would be excellent if Stripe published a truly technical RCA, perhaps for distribution via their tech blog, so that folks like us could get a more complete understanding and what-not-to-do lesson (if the failing systems were based on non-proprietary technologies, that is).

From reading the RCA, this should be the trinity of mysql + orchestrator + vitess. If stripe can't get it right, there is no chance for the others.