Hacker News new | ask | show | jobs
by notreallyserio 1655 days ago
Is this true or a joke? This sort of policy is how you destroy trust.
4 comments

From what I've heard it's mostly true. Not only the CEO but a few SVPs can approve it, but yes a human must approve the update and it must be a high level exec.

Part of the reason is because their SLAs are based on that dashboard, and that dashboard going red has a financial cost to AWS, so like any financial cost, it needs approval.

Sure, but... that just raises more questions :)

Taken literally what you are saying is the service could be down and an executive could override that, preventing them for paying customers for a service outage, even if the service did have an outage and the customer could prove it (screenshots, metrics from other cloud providers, many different folks see it).

I'm sure there is some subtlety to this, but it does mean that large corps with influence should be talking to AWS to ensure that status information corresponds with actual service outages.

Large corps with influence get what they want regardless. Status page goes red and the small corps start thinking they can get what they want too.
> Status page goes red and the small corps start thinking they can get what they want too.

I think you mean "start thinking they can get what they pay for"

I have no inside knowledge or anything but it seems like there are a lot of scenarios with degraded performance where people could argue about whether it really constitutes an outage.
One time gcp argued that since they did return 404s on gcs for a few hours that wasn’t an uptime/latency sla violation so we were not entitled to refund (tho they refunded us anyway)
Man, between costs and shenanigans like this, why don't more companies self-host?
1. Leadership prefers to blame cloud when things break rather than take responsibility.

2. Cost is not an issue (until it is but you’re already locked in so oh well)

3. Faang has drained the talent pool of people who know how

If you think that’s bad you should see the outages when you self host without a big enough team to really manage it.
Opex > Capex. If companies thought about long term, yes they might consider it. But unless the cloud providers fuck up really badly, they're ok to take the heat occasionally and tolerate a bit of nonsense.
Yep. I was an SRE who worked at Google and also launched a product on Google Cloud. We had these arguments all the time, and the contract language often provides a way for the provider to weasel out.
Like I said I never worked there and this is all hearsay but there is a lot of nuance here being missed like partial outages.
This is no longer a partial outage. The status page reports elevated API error rates, DynamoDB issues, EC2 API error rates, and my company's monitoring is significantly affected (IE, our IT folks can't tell us what isn't working) and my AWS training class isn't working either.

If this needed a CEO to eventually get around to pressing a button that said "show users the actual information about a problem" that reflects poorly on amazon.

My friend works at a telemetry company for monitoring and they are working on alerting customers of cloud service outages before the cloud providers since the providers like to sit on their hands for a while (presumably to try and fix it before anyone notices).
Being dishonest about SLAs seems to bear zero cost in this case?
It's not really dishonest though because there is nuance. Most everything in EC2 is still working it seems, just the console is down. So is it really down? It should probably be yellow but not red.
if you cannot access the control plane to create or destroy resources, it is down (partial availability). The jobs that are running are basically zombies.
I'm right in the middle of an AWS-run training and we literally can't run the exercises because of this.

let me repeat that: my AWS trainign that is run by AWS that I pay AWS for isn't working, because AWS is having control plane (or other) issues. This is several hours after the initial incident. We're doing training in us-west-2, but the identity service and other components run in us-east-1.

I’m running EKS in us-west-2. My pods use a role ARN and identity token file to get temporary credentials via STS. STS can’t return credentials right now. So my EKS cluster is “down” in the sense that I can’t bring up new pods. I only noticed because an auto-scaling event failed.
We ran through the whole 4.5 hour training and the training app didn't work the entire time.
Seems like the API is still working and so is auto scaling. So they aren’t really zombies.

Partial availability isn’t the same as no availability.

The API is NOT working -- it may not have been listed on the service health dashboard when you posted that, but it is now. We haven't been able to launch an instance at all, and we are continuously trying. We can't even start existing instances.
Depending the workload being run users may or may not notice. Should be Yellow at a minimum.
Heroku is currently having major problems. My stuff is still up, but I can't deploy any new versions. Heroku runs their stuff on AWS. I have heard reports of other companies who run on AWS also having degarded service and outages.

i'd say when other companies who run their infrastruture on AWS are going out, it's hard to argue it's not a real outage.

But AWS status _has_ changed to yellow at this point. Probably heroku could be completely down because of an AWS problem, and AWS status would still not show red. But at least yellow tells us there's a problem, the distinction between yellow and red probably only matters at this point to lawyers arguing about the AWS SLA, the rest of us know yellow means "problems", red will never be seen, and green means "maybe problems anyway".

I believe the entire us-east-1 could be entirely missing, and they'd still only put a yellow not a red on status page. After all, the other regions are all fine, right?

"Good at finding excuses" is not the same thing as "honest."
SNS seems to be at least partially down as well
My company relies on DynamoDB, so we're totally down.

edit: partly down; it's sporadically failing

Little honesty from cloud providers will ensure regulators don't step in.
Zero directly-attributable, calculable-at-time-of-decision cost. Of course there's a cost in terms of customers who leave because of the dishonest practice, but, who knows how many people that'll be? Out of the customers who left after the outage, who knows whether they left due to not communicating status promptly and honestly or whether it was for some other reason?

Versus, if a company has X SLA contracts signed, that point to Y reimbursement for being out for Z minutes, so it's easily calculable.

I wonder how well known this is. You'd think it would be hard to hire ethical engineers with such a scheme in place and yet they have tens of thousands.
maybe we gotta consider the publicly facing status pages as something other than a technical tool (e.g. marketing or PR or something like that, dunno)
If you trust them at this point, you have not being paying attention, and will probably continue to trust after this.
Well, no big deal, there's not really a lot of trust there to destroy...