Hacker News new | ask | show | jobs
by whoknowswhat11 1651 days ago
I've always wondered why services are not counted down more often. Is there some sliver of customers who have access to the management console for example?

An increase in error rates - no biggie, any large system is going to have errors. But when 80%+ of customers loads in the region are impacted (cross availability zones for whatever good those do) - that counts as down doesn't it? Error rates in one AZ - degraded. Multi-AZ failures - down?

2 comments

SLAs. Officially acknowledging an incident means that they now have to issue the SLA credits.
The outage dashboard is normally only updated if a certain $X percent of hosts / service is down. If the EC2 section were updated every time a rack in a datacenter went down, it would be red 24x7.

It's only updated when a large percentage of customers are impacted, and most of the time this number is less than what the HN echo chamber makes it appear to be.

I mean, sure, there are technical reasons why you would want to buffer issues so they're only visible if something big went down (although one would argue that's exactly what the "degraded" status means).

But if the official records say everything is green, a customer is going to have to push a lot harder to get the credits. There is a massive incentivization to “stay green”.

yes there were. I'm from central europe and we were at least able to get some pages of the console in us-east-1 -but i assume this was more caching related. Even though the console loaded and worked for listing some entries - we weren't able to post a support case nor viewing SQS messages etc.

So i aggree that degraded is not the proper wording - but it's / was not completly vanished. so.... hard to tell what is an common acceptable wording here.

From France, when I connect to "my personal health dashboard" in eu-west-3, it says several services are having "issues" in us-east-1.

To your point, for support center (which doesn't show a region) it says:

Description

Increased Error Rates

[09:01 AM PST] We are investigating increased error rates for the Support Center console and Support API in the US-EAST-1 Region.

[09:26 AM PST] We can confirm increased error rates for the Support Center console and Support API in the US-EAST-1 Region. We have identified the root cause of the issue and are working towards resolution.