Hacker News new | ask | show | jobs
by hibikir 2538 days ago
Hi Dave, you probably won't remember me (we only spent about 2 months together in Stripe), but I bet Mr Larson remembers.

The first question is who is this written for: It lacks the detail I would write for the incident review meeting audience, while lacking a simpler story for the non technical. As it is at the time I read it, I don't think it aims any audience very well.

I understand that the level of detail of the internal report might be excessive for the internal report, but if technical readers are the target, some more details would have helped. For example the monitoring details that Will described in another thread are a key missing detail that, if anything, would make Stripe look better, as problems like that happen all the time. I bet there are more details that are equally useful that would be in an internal report that would not reveal delicate information. In general, the only reason I could follow the document well is that I remember how the Stripe storage system worked last year, and I could handwave a year worth of changes. Since this part of the Stripe infrastructure is relatively unique, it's difficult to understand from the outside, and looks as if it doesn't have enough information.

In particular, the remediations say very little that is understandable from the outside: Most of the text could apply to pretty much any incident on a storage of queuing subsystem I was ever a part of: More alerts, an extra chart in an ever growing dashboard, some circuit breakers to deal with the specific failure shape... It's all real, but without details, it says very little.

I understand why you might not want to divulge that level of detail though. If we want fewer details, then the article could cut all kinds of low-information sections, and instead focus more on the response, and the things that will be changed in the future. The most interesting bit about this is the quick version rollback, which, in retrospect, might not have been the right call. A more detailed view of the alternatives, And why the actions that ultimately led to the second incident were made would be enlightening, and would humanize the piece.

Thank you for not just providing a public root cause analysis, but coming here to discuss it in HN.