| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by jacquesm 2531 days ago
	Why don't they call 'significantly elevated error rates' an 'outage' instead?

3 comments

dps 2531 days ago

(Stripe CTO here)

That's a reasonable question. We wrote this RCA to help our users understand what had happened and to help inform their own response efforts. Because a large absolute number of requests with stateful consequences (including e.g. moving money IRL) succeeded during the event, we wanted to avoid customers believing that retrying all requests would be necessarily safe. For example, users (if they don’t use idempotency keys in our API) who simply decided to re-charge all orders in their database during the event might inadvertently double charge some of their customers. We hear you on the transparency point, though, and will likely describe events of similar magnitude as an "outage" in the future - thank you for the feedback.

link

jacquesm 2531 days ago

And thank you for the answer and for being open to outsider input.

link

NikolaeVarius 2531 days ago

Because "A substantial majority of API requests during these windows failed. " implying that there was not a complete outage.

I don't understand why people demand the usage of incorrect language.

link

teraflop 2531 days ago

In my mind, a "degradation" would be if some fraction of requests were randomly failing, but they would be likely to eventually succeed if retried. Or if the service itself was essentially accessible, but some non-essential functionality was not working correctly.

On the other hand, if for a significant number of users the site was completely unusable for some period of time, then I think it's fair to use the word "outage". (Even if it's not a complete outage affecting all users.)

I don't know whether other people would interpret these terms the same way I do, nor do I think there's enough information in this blog post to determine for sure which label is more accurate for this particular incident. So personally, I'm not going to be too picky about the wording.

link

ComputerGuru 2531 days ago

> Because "A substantial majority of API requests during these windows failed. " implying that there was not a complete outage.

The fact that you needed to qualify “outage” with “complete” clearly means the word on its own is not incorrect for cases where a system was “only” mostly unavailable rather than completely so.

> I don't understand why people demand the usage of incorrect language.

The irony.

link

dmlittle 2531 days ago

My guess is that it's because not everything was down so it wasn't a total outage. From the post mortem:

> Stripe splits data by kind into different database clusters and by quantity into different shards.

So in theory any request that didn't interact with the problematic database should have been OK (I don't know if the offending DB was in the critical path of _every_ request).

link