|
|
|
|
|
by dps
2531 days ago
|
|
(Stripe CTO here) Thanks for the questions. We have testing procedures and deploy mechanisms that enable us to ship hundreds of deploys a week safely, including many which touch our infrastructure. For example, we do a fleetwide version rollout in stages with a blue/green deploy for typical changes. In this case, we identified a specific code path that we believed had a high potential to cause a follow-up incident soon. The course of action was reviewed by several engineers; however we lacked an efficient way to fully validate this change on the order of minutes. We're investing in building tooling to increase robustness in rapid response mechanisms and to help responding engineers understand the potential impact of configuration changes or other remediation efforts they're pushing through an accelerated process. I think our engineers’ approach was strong here, but our processes could have been better. Our continuing remediation efforts are focused there. |
|
I hope that lessons are learned from this operational event, and invest towards building metrics and tooling that allows you to, first of all, prevent issues, and second, shorten the outage/mitigation times in the future.
I'm happy you guys are being open about the issue, and taking feedback from people outside your company. I definitely applaud this.