| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by logical_proof 1651 days ago
	That is not true. There was hours before they started annotating any kind of service issues. Maybe from when you noticed there was a problem it appeared to be quick, but the board remained green for a large portion of the outtage.

2 comments

electroly 1651 days ago

No, it was about an hour. We were aware from the very moment EC2 API error rates began to elevate, around 10:30 Eastern. By 11:30 the dashboard was updating. This timing is mentioned in the article, and it all happened in the middle of our workday on the east coast. The outage then continued for about 7 hours with SHD updates. I suspect we actually both agree on how long it took them to start updating, but I conclude that 1 hour wasn't so bad.

link

gkop 1651 days ago

At the large platform company where I work, our policy is if the customer reported the issue before our internal monitoring caught it, we have failed. Give 5 minutes for alerting lag, 10 minutes to evaluate the magnitude of impact, 10 minutes to craft the content and get it approved, 5 minutes to execute the update, adds up to 30 minutes end to end with healthy buffer at each step.

1 hour (52 minutes according to the article) sounds meh. I wonder what their error rate and latency graphs look like from that day.

link

Aperocky 1651 days ago

> our policy is if the customer reported the issue before our internal monitoring caught it

They've discovered it right away, the Service Health Dashboard was not updated. source: link.

link

gkop 1651 days ago

They don’t say explicitly right away do they? I skimmed twice.

But yes you’re right, there’s no reason to question their monitoring or alerting specifically.

link

acdha 1651 days ago

We saw the timing described where the dashboard updates started about an hour after the problem began (which we noticed immediately since 7:30AM Pacific is in the middle of the day for those of us in Eastern time). I don't know if there was an issue with browser caching or similar but once the updates started everyone here had no trouble seeing them and my RSS feed monitor picked them up around that time as well.

link