Hacker News new | ask | show | jobs
by fletom 3397 days ago
what's truly incredible is that S3 has been offline for h̶a̶l̶f̶ ̶a̶n̶ ̶h̶o̶u̶r̶ two hours now and Amazon still has the audacity to put five shiny green checkmarks next to S3 on their service page.

they just now put up a box at the top saying "We are investigating increased error rates for Amazon S3 requests in the US-EAST-1 Region."

increased error rates? really?

Amazon, everything is on fire. you are not fooling anyone

edit: in the future, please subscribe to @MyFootballNow for timely AWS service status updates https://pbs.twimg.com/media/C5xdm9_WMAAY7y_.jpg:large

13 comments

@mikecb on Twitter explained it well. "The red icon is stored in S3 US East."
Apparently this is not a joke:

“The dashboard not changing color is related to S3 issue. See the banner at the top of the dashboard for updates.”

https://twitter.com/awscloud/status/836656664635846656

I think that might be my favourite tweet this year so far.
There are some real gems @Pinboard too: "Green checkmark = no lava in data center. Green checkmark with information icon = data center filling with lava https://status.aws.amazon.com"
FYI, in seriousness you can see the fabled red status icon here:

https://status.aws.amazon.com/images/status3.gif

It does exist, apparently.

It showed up during the big DynamoDB outage last year.
Is the manager of that group still working there post-outage?
I thought this was a joke, but apparently not too far off:

"The dashboard not changing color is related to S3 issue. See the banner at the top of the dashboard for updates."

https://twitter.com/awscloud/status/836656664635846656

While that may be true, that's not the reason you're seeing green. You should have been seeing a broken image or a status page not finishing loading if that was an issue.
The best jokes always have a grain of truth.
If this is how they're going to handle an outage of their premier AWS service, it's other cloud providers that will be seeing green.
Funny, cloudflare is also having trouble, page only showed on the third request.

Perfect storm.

As for other cloud providers seeing green: Or maybe people will come to their senses and will see that monocultures are bad, whether in biology or hosting.

> Funny cloudflare is also having trouble, page only showed on the third request.

I bet they're related. The moment I got an alert of the S3 outage I started refreshing a bunch of status pages at a feverish pitch. Multiply that by a thousands of others doing the same and boom you've got the equivalent of a DDOS.

dang [0] just commented to say that their server is getting hammered because of this thread.

[0] https://news.ycombinator.com/item?id=13756819

I've been getting intermittent Bad Gateways on HN for the last few days.

Ray ID: 33863460edf54231

That was (obviously) sarcasm :)
Humour is wasted on the internet.
or was it? dun dun
(Disclaimer: I work for AWS.)

The dashboard is not changing color due to the S3 issue. We're updating the banner in place of that.

Edit: Update at 11:35 AM PST: We have now repaired the ability to update the service health dashboard. The service updates are below. We continue to experience high error rates with S3 in US-EAST-1, which is impacting various AWS services. We are working hard at repairing S3, believe we understand root cause, and are working on implementing what we believe will remediate the issue.

http://status.aws.amazon.com/

For some reason, reading about "believe we understand root cause" made me think of: "A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable."
Maybe you could encourage your colleagues to host the status page outside of AWS?
We'll have to wait for the postmortem, but I bet it was an unintentional dependency on S3 that no one realized had come into place until S3 went down -- especially considering how fast they were able to remove the dependency and fix it.
This reminds me of a GitHub outage from them having a build dependency on GitHub. IIRC, they tried to roll back to building a prior version but since the site was offline, the build failed.
S3 gets used to store a lot of static content. Can't speak for that team, but I'm sure they'll take that feedback. Happy the banner functionality remained unimpeded.
Possibly AWS status page wisely relied on a third party, which relied on a fourth party, which relied on S3.
It took them ~30 minutes.
It took them 2 hours actually.
Maybe with GCP? :)
I'm happy to offer up some spare space on my godaddy hosted Linux plan if that helps...
Could you go more in depth? What does S3 have anything to do with it?
I think the most reasonable guess is that they have some backend system that continously pushes some status json/xml file to an S3 bucket.

Then there's the frontend, that apparently periodically reads this file from S3 and caches the results.

I guess the comment they added on the top after two hours of being in the dark was likely manually added to the web frontend.

Obviously all of this would be hilariously badly designed if it was made this way. Still...

It's where they store the error icon.
So the "working" icon works, but the "not working" doesn't? I'm not sure that's right.
Tragic comedy gold
What we need are status pages that are driven by votes from verified customers, which could also serve to inform the provider about issues.

This would address issues that are only visible from the outside.

And system monitoring which isn't dependent on itself. Kind of a "duh" kind of thing...
http://outage.report/ does this pretty much, except for the "verified customers" part.
> Amazon, everything is on fire. you are not fooling anyone

Fun story, when I was an intern at Amazon there was actually a warehouse fire. The result was a lot of manual database entry updating as products were determined to be destroyed or still fit for sale.

I'm curious about what happened to products that were no longer fit for sale, but still fit for use. Do you recall?
In the military, a warehouse fire or equivalent suddenly generates a ton of "backdated transfer requests" showing that various stock had been sent to the warehouse just previously!
This sounds like rank corruption. Surely such a thing is rare in the military?
To be fair, there's a plausible explanation for what robaato describes that doesn't involve corruption. Suppose it's standard or common to move things first and then file such "backdated transfer requests". After a fire that destroys everything in a warehouse, there would be a flurry of activity to quickly account for everything that was destroyed, so paperwork that would otherwise have trickled in over a month or two might suddenly be hurriedly filed in a few days.
It could just be lackadaisical administration that only gets urgently addressed when there is something perceived as a particular problem.

The military is not exactly known for being great at keeping track of things that aren't nuclear weapons, and sometimes falls short even on those.

There is "Amazon Warehouse Deals", Amazon itself acting as a used products seller on Amazon. This is usually used for returns etc., but I wouldn't be surprised if they also handle something like this.
There are businesses that specialize in remaindering fire-damaged goods -- mostly stuff that smells like smoke.

They showed up in my town in the early 1980s after one of our local malls had a smokey fire. They sold a bunch of stuff that came from other places, too, including a ton of 15mm miniature soldiers.

I like how this post says "if you look at the AWS Status Page, this what you see". but you can't see the image. because S3 is down.
I thought this was funny so I took a screenshot of the blog post and uploaded it to the company Slack. The upload failed because Slack uses S3.

This is getting crazy.

If this isn't good evidence that amazon downright lies on their status page and that no green checkmark should ever be considered trustworthy, I don't know what is.
Why is the status board hosted on AWS? Most providers host such pages on a 3rd party, specifically for this reason; correlated failure.
> edit: in the future, please subscribe to @MyFootballNow for timely AWS service status updates https://pbs.twimg.com/media/C5xdm9_WMAAY7y_.jpg:large

So this is what centralization looks like.

I hear that their process for updating the status page involves S3.
it would appear that you are correct

"The dashboard not changing color is related to S3 issue. See the banner at the top of the dashboard for updates." https://twitter.com/awscloud/status/836656664635846656

I very seriously thought that it was a joke to say that S3 was needed to show the red icon, but apparently they can't update the dashboard about the status of S3 because of S3.
So... https://twitter.com/awscloud/status/836656664635846656

"The dashboard not changing color is related to S3 issue. See the banner at the top of the dashboard for updates."

the non-green icon is probably hosted on s3 (i'm not trying to be funny)
"Update at 11:35 AM PST: We have now repaired the ability to update the service health dashboard."

Yep.

I love it, like that fixes the problem! ..now fix the REAL problem
I don't think they intentionally kept the checkmarks there. They probably just didn't update it as quickly as developers made a post on Hacker News (not surprising, they were probably investigating).
After having seen multiple AWS outages/service disruptions, with nothing other than a green checkmark ever showing, I am now very confident that the checkmarks are hardcoded and there is no logic behind them.
It's already been confirmed by amazon employees on HN that the color can only be changed manually by an employee and it needs a high level of approval.

Also, there are incentives based on colors, so the managers really don't want to admit any failure.

While this is a personal feeling and I don't have any data (metrics) to back it up: I think a large percentage (and probably a majority) of metrics don't end up helping a company once they are created - especially if any salary or bonuses are based on them. They are always so gamed that they become worthless.

This is a great case in point if true.

Your point hits on a true thing. One problem is that companies measure proxies for performance, not performance itself. A great book on the topic (and related topics) is Weapons of Math Destruction. Anyway, a green checkbox is pretty far into proxie-land. It's not very closely related to client retention or profitability, and now we see it's not even related to operational time of the equipment. Yikes. So a proxy like this is not even worth using as a metric; it can only cause false confidence that some information is known, and that leads to bad decisions. Not the least of which is bonusing incompetent managers.
Amazon has metrics so they can tell a story, not so they can measure things.

As a cute example, one of their senior people (in a stats heavy role) couldn't explain how they'd detect if people wanted to be able to automatically order socks and tshirts on a buying cycle outside of what I call the "scheduling horizon", eg every 3-6mos. (Things I need regularly, but sparsely enough it doesn't stand out to do proactively -- eg, I buy socks when they all have holes, not on a reasonable replacement cycle.)

Yup probably some incentives due to SLA's for their larger customers.
> Also, there are incentives based on colors, so the managers really don't want to admit any failure.

A textbook case of "wrong incentives". #1 incentive should be satisfied customers.

You can have a high level of customer satisfaction if you lie to them and arrange so that they don't even notice, and by having a good damage control strategy for when some customers do notice they aren't getting that they were promised.

Such approach has better ROI than actually doing high quality products or services, which is why so much of what we buy is utter shit. That's especially true on the mass market, when satisfaction of individual customers doesn't impact your company at all, as long as they're not complaining too loud.

Are any incentives affected by customer complaints? Because if so, all we need to do is complain and they might actually start using the status system meaningfully. (I'm sure this is wilful thinking, though; I doubt people don't complain in this situation!)
when you start investigating an outage, that is exactly when you should change your checkmark to yellow if not red. if you're as big as AWS there should not be any more than a minute or two between when your service goes down and when you actually update your status page to show that.
It should actually just be automated.