That's a reasonable question. We wrote this RCA to help our users understand what had happened and to help inform their own response efforts. Because a large absolute number of requests with stateful consequences (including e.g. moving money IRL) succeeded during the event, we wanted to avoid customers believing that retrying all requests would be necessarily safe. For example, users (if they don’t use idempotency keys in our API) who simply decided to re-charge all orders in their database during the event might inadvertently double charge some of their customers. We hear you on the transparency point, though, and will likely describe events of similar magnitude as an "outage" in the future - thank you for the feedback.
In my mind, a "degradation" would be if some fraction of requests were randomly failing, but they would be likely to eventually succeed if retried. Or if the service itself was essentially accessible, but some non-essential functionality was not working correctly.
On the other hand, if for a significant number of users the site was completely unusable for some period of time, then I think it's fair to use the word "outage". (Even if it's not a complete outage affecting all users.)
I don't know whether other people would interpret these terms the same way I do, nor do I think there's enough information in this blog post to determine for sure which label is more accurate for this particular incident. So personally, I'm not going to be too picky about the wording.
> Because "A substantial majority of API requests during these windows failed. " implying that there was not a complete outage.
The fact that you needed to qualify “outage” with “complete” clearly means the word on its own is not incorrect for cases where a system was “only” mostly unavailable rather than completely so.
> I don't understand why people demand the usage of incorrect language.
My guess is that it's because not everything was down so it wasn't a total outage. From the post mortem:
> Stripe splits data by kind into different database clusters and by quantity into different shards.
So in theory any request that didn't interact with the problematic database should have been OK (I don't know if the offending DB was in the critical path of _every_ request).
That's a reasonable question. We wrote this RCA to help our users understand what had happened and to help inform their own response efforts. Because a large absolute number of requests with stateful consequences (including e.g. moving money IRL) succeeded during the event, we wanted to avoid customers believing that retrying all requests would be necessarily safe. For example, users (if they don’t use idempotency keys in our API) who simply decided to re-charge all orders in their database during the event might inadvertently double charge some of their customers. We hear you on the transparency point, though, and will likely describe events of similar magnitude as an "outage" in the future - thank you for the feedback.