| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by notreallyserio 1655 days ago
	Is this true or a joke? This sort of policy is how you destroy trust.

4 comments

jedberg 1655 days ago

From what I've heard it's mostly true. Not only the CEO but a few SVPs can approve it, but yes a human must approve the update and it must be a high level exec.

Part of the reason is because their SLAs are based on that dashboard, and that dashboard going red has a financial cost to AWS, so like any financial cost, it needs approval.

dekhn 1655 days ago

Sure, but... that just raises more questions :)

Taken literally what you are saying is the service could be down and an executive could override that, preventing them for paying customers for a service outage, even if the service did have an outage and the customer could prove it (screenshots, metrics from other cloud providers, many different folks see it).

I'm sure there is some subtlety to this, but it does mean that large corps with influence should be talking to AWS to ensure that status information corresponds with actual service outages.

meetups323 1655 days ago

Large corps with influence get what they want regardless. Status page goes red and the small corps start thinking they can get what they want too.

scrose 1655 days ago

> Status page goes red and the small corps start thinking they can get what they want too.

I think you mean "start thinking they can get what they pay for"

emodendroket 1655 days ago

I have no inside knowledge or anything but it seems like there are a lot of scenarios with degraded performance where people could argue about whether it really constitutes an outage.

dilyevsky 1655 days ago

One time gcp argued that since they did return 404s on gcs for a few hours that wasn’t an uptime/latency sla violation so we were not entitled to refund (tho they refunded us anyway)

Enginerrrd 1655 days ago

Man, between costs and shenanigans like this, why don't more companies self-host?

dilyevsky 1655 days ago

1. Leadership prefers to blame cloud when things break rather than take responsibility.

2. Cost is not an issue (until it is but you’re already locked in so oh well)

3. Faang has drained the talent pool of people who know how

emodendroket 1654 days ago

If you think that’s bad you should see the outages when you self host without a big enough team to really manage it.

pm90 1655 days ago

Opex > Capex. If companies thought about long term, yes they might consider it. But unless the cloud providers fuck up really badly, they're ok to take the heat occasionally and tolerate a bit of nonsense.

dekhn 1655 days ago

Yep. I was an SRE who worked at Google and also launched a product on Google Cloud. We had these arguments all the time, and the contract language often provides a way for the provider to weasel out.

jedberg 1655 days ago

Like I said I never worked there and this is all hearsay but there is a lot of nuance here being missed like partial outages.

dekhn 1655 days ago

This is no longer a partial outage. The status page reports elevated API error rates, DynamoDB issues, EC2 API error rates, and my company's monitoring is significantly affected (IE, our IT folks can't tell us what isn't working) and my AWS training class isn't working either.

If this needed a CEO to eventually get around to pressing a button that said "show users the actual information about a problem" that reflects poorly on amazon.

dhsigweb 1655 days ago

My friend works at a telemetry company for monitoring and they are working on alerting customers of cloud service outages before the cloud providers since the providers like to sit on their hands for a while (presumably to try and fix it before anyone notices).

orangepurple 1655 days ago

Being dishonest about SLAs seems to bear zero cost in this case?

jedberg 1655 days ago

It's not really dishonest though because there is nuance. Most everything in EC2 is still working it seems, just the console is down. So is it really down? It should probably be yellow but not red.

dekhn 1655 days ago

if you cannot access the control plane to create or destroy resources, it is down (partial availability). The jobs that are running are basically zombies.

dekhn 1655 days ago

I'm right in the middle of an AWS-run training and we literally can't run the exercises because of this.

let me repeat that: my AWS trainign that is run by AWS that I pay AWS for isn't working, because AWS is having control plane (or other) issues. This is several hours after the initial incident. We're doing training in us-west-2, but the identity service and other components run in us-east-1.

justrudd 1655 days ago

I’m running EKS in us-west-2. My pods use a role ARN and identity token file to get temporary credentials via STS. STS can’t return credentials right now. So my EKS cluster is “down” in the sense that I can’t bring up new pods. I only noticed because an auto-scaling event failed.

dekhn 1655 days ago

We ran through the whole 4.5 hour training and the training app didn't work the entire time.

jedberg 1655 days ago

Seems like the API is still working and so is auto scaling. So they aren’t really zombies.

Partial availability isn’t the same as no availability.

electroly 1655 days ago

The API is NOT working -- it may not have been listed on the service health dashboard when you posted that, but it is now. We haven't been able to launch an instance at all, and we are continuously trying. We can't even start existing instances.

w0m 1655 days ago

Depending the workload being run users may or may not notice. Should be Yellow at a minimum.

jrochkind1 1655 days ago

Heroku is currently having major problems. My stuff is still up, but I can't deploy any new versions. Heroku runs their stuff on AWS. I have heard reports of other companies who run on AWS also having degarded service and outages.

i'd say when other companies who run their infrastruture on AWS are going out, it's hard to argue it's not a real outage.

But AWS status _has_ changed to yellow at this point. Probably heroku could be completely down because of an AWS problem, and AWS status would still not show red. But at least yellow tells us there's a problem, the distinction between yellow and red probably only matters at this point to lawyers arguing about the AWS SLA, the rest of us know yellow means "problems", red will never be seen, and green means "maybe problems anyway".

I believe the entire us-east-1 could be entirely missing, and they'd still only put a yellow not a red on status page. After all, the other regions are all fine, right?

jjoonathan 1655 days ago

"Good at finding excuses" is not the same thing as "honest."

paulryanrogers 1655 days ago

SNS seems to be at least partially down as well

jtheory 1655 days ago

My company relies on DynamoDB, so we're totally down.

edit: partly down; it's sporadically failing

mukundesh 1643 days ago

Little honesty from cloud providers will ensure regulators don't step in.

solatic 1655 days ago

Zero directly-attributable, calculable-at-time-of-decision cost. Of course there's a cost in terms of customers who leave because of the dishonest practice, but, who knows how many people that'll be? Out of the customers who left after the outage, who knows whether they left due to not communicating status promptly and honestly or whether it was for some other reason?

Versus, if a company has X SLA contracts signed, that point to Y reimbursement for being out for Z minutes, so it's easily calculable.

notreallyserio 1655 days ago

I wonder how well known this is. You'd think it would be hard to hire ethical engineers with such a scheme in place and yet they have tens of thousands.

bsedlm 1655 days ago

maybe we gotta consider the publicly facing status pages as something other than a technical tool (e.g. marketing or PR or something like that, dunno)

marcosdumay 1655 days ago

If you trust them at this point, you have not being paying attention, and will probably continue to trust after this.

jeffrallen 1655 days ago

Well, no big deal, there's not really a lot of trust there to destroy...