From what I've heard it's mostly true. Not only the CEO but a few SVPs can approve it, but yes a human must approve the update and it must be a high level exec.
Part of the reason is because their SLAs are based on that dashboard, and that dashboard going red has a financial cost to AWS, so like any financial cost, it needs approval.
Taken literally what you are saying is the service could be down and an executive could override that, preventing them for paying customers for a service outage, even if the service did have an outage and the customer could prove it (screenshots, metrics from other cloud providers, many different folks see it).
I'm sure there is some subtlety to this, but it does mean that large corps with influence should be talking to AWS to ensure that status information corresponds with actual service outages.
I have no inside knowledge or anything but it seems like there are a lot of scenarios with degraded performance where people could argue about whether it really constitutes an outage.
One time gcp argued that since they did return 404s on gcs for a few hours that wasn’t an uptime/latency sla violation so we were not entitled to refund (tho they refunded us anyway)
Opex > Capex. If companies thought about long term, yes they might consider it. But unless the cloud providers fuck up really badly, they're ok to take the heat occasionally and tolerate a bit of nonsense.
Yep. I was an SRE who worked at Google and also launched a product on Google Cloud. We had these arguments all the time, and the contract language often provides a way for the provider to weasel out.
This is no longer a partial outage. The status page reports elevated API error rates, DynamoDB issues, EC2 API error rates, and my company's monitoring is significantly affected (IE, our IT folks can't tell us what isn't working) and my AWS training class isn't working either.
If this needed a CEO to eventually get around to pressing a button that said "show users the actual information about a problem" that reflects poorly on amazon.
My friend works at a telemetry company for monitoring and they are working on alerting customers of cloud service outages before the cloud providers since the providers like to sit on their hands for a while (presumably to try and fix it before anyone notices).
It's not really dishonest though because there is nuance. Most everything in EC2 is still working it seems, just the console is down. So is it really down? It should probably be yellow but not red.
if you cannot access the control plane to create or destroy resources, it is down (partial availability). The jobs that are running are basically zombies.
I'm right in the middle of an AWS-run training and we literally can't run the exercises because of this.
let me repeat that: my AWS trainign that is run by AWS that I pay AWS for isn't working, because AWS is having control plane (or other) issues. This is several hours after the initial incident. We're doing training in us-west-2, but the identity service and other components run in us-east-1.
I’m running EKS in us-west-2. My pods use a role ARN and identity token file to get temporary credentials via STS. STS can’t return credentials right now. So my EKS cluster is “down” in the sense that I can’t bring up new pods. I only noticed because an auto-scaling event failed.
The API is NOT working -- it may not have been listed on the service health dashboard when you posted that, but it is now. We haven't been able to launch an instance at all, and we are continuously trying. We can't even start existing instances.
Heroku is currently having major problems. My stuff is still up, but I can't deploy any new versions. Heroku runs their stuff on AWS. I have heard reports of other companies who run on AWS also having degarded service and outages.
i'd say when other companies who run their infrastruture on AWS are going out, it's hard to argue it's not a real outage.
But AWS status _has_ changed to yellow at this point. Probably heroku could be completely down because of an AWS problem, and AWS status would still not show red. But at least yellow tells us there's a problem, the distinction between yellow and red probably only matters at this point to lawyers arguing about the AWS SLA, the rest of us know yellow means "problems", red will never be seen, and green means "maybe problems anyway".
I believe the entire us-east-1 could be entirely missing, and they'd still only put a yellow not a red on status page. After all, the other regions are all fine, right?
Zero directly-attributable, calculable-at-time-of-decision cost. Of course there's a cost in terms of customers who leave because of the dishonest practice, but, who knows how many people that'll be? Out of the customers who left after the outage, who knows whether they left due to not communicating status promptly and honestly or whether it was for some other reason?
Versus, if a company has X SLA contracts signed, that point to Y reimbursement for being out for Z minutes, so it's easily calculable.
I wonder how well known this is. You'd think it would be hard to hire ethical engineers with such a scheme in place and yet they have tens of thousands.
Part of the reason is because their SLAs are based on that dashboard, and that dashboard going red has a financial cost to AWS, so like any financial cost, it needs approval.