| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by fletom 3397 days ago

what's truly incredible is that S3 has been offline for h̶a̶l̶f̶ ̶a̶n̶ ̶h̶o̶u̶r̶ two hours now and Amazon still has the audacity to put five shiny green checkmarks next to S3 on their service page.

they just now put up a box at the top saying "We are investigating increased error rates for Amazon S3 requests in the US-EAST-1 Region."

increased error rates? really?

Amazon, everything is on fire. you are not fooling anyone

edit: in the future, please subscribe to @MyFootballNow for timely AWS service status updates https://pbs.twimg.com/media/C5xdm9_WMAAY7y_.jpg:large

13 comments

idlewords 3397 days ago

@mikecb on Twitter explained it well. "The red icon is stored in S3 US East."

link

jasoncrawford 3397 days ago

Apparently this is not a joke:

“The dashboard not changing color is related to S3 issue. See the banner at the top of the dashboard for updates.”

https://twitter.com/awscloud/status/836656664635846656

link

noir_lord 3397 days ago

I think that might be my favourite tweet this year so far.

link

mpetrovich 3397 days ago

There are some real gems @Pinboard too: "Green checkmark = no lava in data center. Green checkmark with information icon = data center filling with lava https://status.aws.amazon.com"

link

paulddraper 3397 days ago

FYI, in seriousness you can see the fabled red status icon here:

https://status.aws.amazon.com/images/status3.gif

It does exist, apparently.

link

rsynnott 3397 days ago

It showed up during the big DynamoDB outage last year.

link

busterarm 3397 days ago

Is the manager of that group still working there post-outage?

link

tlrobinson 3397 days ago

I thought this was a joke, but apparently not too far off:

"The dashboard not changing color is related to S3 issue. See the banner at the top of the dashboard for updates."

https://twitter.com/awscloud/status/836656664635846656

link

scarlac 3397 days ago

While that may be true, that's not the reason you're seeing green. You should have been seeing a broken image or a status page not finishing loading if that was an issue.

link

eknkc 3397 days ago

https://en.m.wikipedia.org/wiki/Sarcasm

link

eherot 3397 days ago

Or so we thought...

https://twitter.com/awscloud/status/836656664635846656

link

mikecb 3397 days ago

The best jokes always have a grain of truth.

link

koolba 3397 days ago

If this is how they're going to handle an outage of their premier AWS service, it's other cloud providers that will be seeing green.

link

jacquesm 3397 days ago

Funny, cloudflare is also having trouble, page only showed on the third request.

Perfect storm.

As for other cloud providers seeing green: Or maybe people will come to their senses and will see that monocultures are bad, whether in biology or hosting.

link

koolba 3397 days ago

> Funny cloudflare is also having trouble, page only showed on the third request.

I bet they're related. The moment I got an alert of the S3 outage I started refreshing a bunch of status pages at a feverish pitch. Multiply that by a thousands of others doing the same and boom you've got the equivalent of a DDOS.

link

leesalminen 3397 days ago

dang [0] just commented to say that their server is getting hammered because of this thread.

[0] https://news.ycombinator.com/item?id=13756819

link

maxerickson 3397 days ago

I've been getting intermittent Bad Gateways on HN for the last few days.

Ray ID: 33863460edf54231

link

general_failure 3397 days ago

That was (obviously) sarcasm :)

link

tonyedgecombe 3397 days ago

Humour is wasted on the internet.

link

mikecb 3397 days ago

or was it? dun dun

link

ckozlowski 3397 days ago

(Disclaimer: I work for AWS.)

The dashboard is not changing color due to the S3 issue. We're updating the banner in place of that.

Edit: Update at 11:35 AM PST: We have now repaired the ability to update the service health dashboard. The service updates are below. We continue to experience high error rates with S3 in US-EAST-1, which is impacting various AWS services. We are working hard at repairing S3, believe we understand root cause, and are working on implementing what we believe will remediate the issue.

http://status.aws.amazon.com/

link

Ph4nt0m 3397 days ago

For some reason, reading about "believe we understand root cause" made me think of: "A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable."

link

perlgeek 3397 days ago

Maybe you could encourage your colleagues to host the status page outside of AWS?

link

hayleox 3397 days ago

We'll have to wait for the postmortem, but I bet it was an unintentional dependency on S3 that no one realized had come into place until S3 went down -- especially considering how fast they were able to remove the dependency and fix it.

link

koolba 3397 days ago

This reminds me of a GitHub outage from them having a build dependency on GitHub. IIRC, they tried to roll back to building a prior version but since the site was offline, the build failed.

link

ckozlowski 3397 days ago

S3 gets used to store a lot of static content. Can't speak for that team, but I'm sure they'll take that feedback. Happy the banner functionality remained unimpeded.

link

all_usernames 3396 days ago

Possibly AWS status page wisely relied on a third party, which relied on a fourth party, which relied on S3.

link

idlewords 3397 days ago

It took them ~30 minutes.

link

LunaSea 3397 days ago

It took them 2 hours actually.

link

johansch 3397 days ago

Maybe with GCP? :)

link

apapli 3397 days ago

I'm happy to offer up some spare space on my godaddy hosted Linux plan if that helps...

link

danappelxx 3397 days ago

Could you go more in depth? What does S3 have anything to do with it?

link

johansch 3397 days ago

I think the most reasonable guess is that they have some backend system that continously pushes some status json/xml file to an S3 bucket.

Then there's the frontend, that apparently periodically reads this file from S3 and caches the results.

I guess the comment they added on the top after two hours of being in the dark was likely manually added to the web frontend.

Obviously all of this would be hilariously badly designed if it was made this way. Still...

link

unfunco 3397 days ago

It's where they store the error icon.

link

danappelxx 3396 days ago

So the "working" icon works, but the "not working" doesn't? I'm not sure that's right.

link

vocatus_gate 3397 days ago

Tragic comedy gold

link

stretchwithme 3397 days ago

What we need are status pages that are driven by votes from verified customers, which could also serve to inform the provider about issues.

This would address issues that are only visible from the outside.

link

laughfactory 3397 days ago

And system monitoring which isn't dependent on itself. Kind of a "duh" kind of thing...

link

ATsch 3397 days ago

http://outage.report/ does this pretty much, except for the "verified customers" part.

link

j2kun 3397 days ago

> Amazon, everything is on fire. you are not fooling anyone

Fun story, when I was an intern at Amazon there was actually a warehouse fire. The result was a lot of manual database entry updating as products were determined to be destroyed or still fit for sale.

link

wildmusings 3397 days ago

I'm curious about what happened to products that were no longer fit for sale, but still fit for use. Do you recall?

link

robaato 3397 days ago

In the military, a warehouse fire or equivalent suddenly generates a ton of "backdated transfer requests" showing that various stock had been sent to the warehouse just previously!

link

dTal 3397 days ago

This sounds like rank corruption. Surely such a thing is rare in the military?

link

wildmusings 3397 days ago

To be fair, there's a plausible explanation for what robaato describes that doesn't involve corruption. Suppose it's standard or common to move things first and then file such "backdated transfer requests". After a fire that destroys everything in a warehouse, there would be a flurry of activity to quickly account for everything that was destroyed, so paperwork that would otherwise have trickled in over a month or two might suddenly be hurriedly filed in a few days.

link

dragonwriter 3397 days ago

It could just be lackadaisical administration that only gets urgently addressed when there is something perceived as a particular problem.

The military is not exactly known for being great at keeping track of things that aren't nuclear weapons, and sometimes falls short even on those.

link

hobofan 3397 days ago

There is "Amazon Warehouse Deals", Amazon itself acting as a used products seller on Amazon. This is usually used for returns etc., but I wouldn't be surprised if they also handle something like this.

link

greglindahl 3397 days ago

There are businesses that specialize in remaindering fire-damaged goods -- mostly stuff that smells like smoke.

They showed up in my town in the early 1980s after one of our local malls had a smokey fire. They sold a bunch of stuff that came from other places, too, including a ton of 15mm miniature soldiers.

link

evtothedev 3397 days ago

The AWS Status page will lie to you: https://medium.com/@ev.dev.dev/the-aws-status-page-will-lie-...

link

fletom 3397 days ago

I like how this post says "if you look at the AWS Status Page, this what you see". but you can't see the image. because S3 is down.

link

brianpgordon 3397 days ago

I thought this was funny so I took a screenshot of the blog post and uploaded it to the company Slack. The upload failed because Slack uses S3.

This is getting crazy.

link

kevin_b_er 3397 days ago

If this isn't good evidence that amazon downright lies on their status page and that no green checkmark should ever be considered trustworthy, I don't know what is.

link

twistedpair 3397 days ago

Why is the status board hosted on AWS? Most providers host such pages on a 3rd party, specifically for this reason; correlated failure.

link

rinze 3397 days ago

> edit: in the future, please subscribe to @MyFootballNow for timely AWS service status updates https://pbs.twimg.com/media/C5xdm9_WMAAY7y_.jpg:large

So this is what centralization looks like.

link

skywhopper 3397 days ago

I hear that their process for updating the status page involves S3.

link

fletom 3397 days ago

it would appear that you are correct

"The dashboard not changing color is related to S3 issue. See the banner at the top of the dashboard for updates." https://twitter.com/awscloud/status/836656664635846656

link

kevin_b_er 3397 days ago

I very seriously thought that it was a joke to say that S3 was needed to show the red icon, but apparently they can't update the dashboard about the status of S3 because of S3.

link

knaik94 3397 days ago

So... https://twitter.com/awscloud/status/836656664635846656

"The dashboard not changing color is related to S3 issue. See the banner at the top of the dashboard for updates."

link

gtrubetskoy 3397 days ago

the non-green icon is probably hosted on s3 (i'm not trying to be funny)

link

whafro 3397 days ago

"Update at 11:35 AM PST: We have now repaired the ability to update the service health dashboard."

Yep.

link

_ao789 3397 days ago

I love it, like that fixes the problem! ..now fix the REAL problem

link

artursapek 3397 days ago

I don't think they intentionally kept the checkmarks there. They probably just didn't update it as quickly as developers made a post on Hacker News (not surprising, they were probably investigating).

link

hobofan 3397 days ago

After having seen multiple AWS outages/service disruptions, with nothing other than a green checkmark ever showing, I am now very confident that the checkmarks are hardcoded and there is no logic behind them.

link

user5994461 3397 days ago

It's already been confirmed by amazon employees on HN that the color can only be changed manually by an employee and it needs a high level of approval.

Also, there are incentives based on colors, so the managers really don't want to admit any failure.

link

bdavisx 3397 days ago

While this is a personal feeling and I don't have any data (metrics) to back it up: I think a large percentage (and probably a majority) of metrics don't end up helping a company once they are created - especially if any salary or bonuses are based on them. They are always so gamed that they become worthless.

This is a great case in point if true.

link

rab-the-goat 3397 days ago

Your point hits on a true thing. One problem is that companies measure proxies for performance, not performance itself. A great book on the topic (and related topics) is Weapons of Math Destruction. Anyway, a green checkbox is pretty far into proxie-land. It's not very closely related to client retention or profitability, and now we see it's not even related to operational time of the equipment. Yikes. So a proxy like this is not even worth using as a metric; it can only cause false confidence that some information is known, and that leads to bad decisions. Not the least of which is bonusing incompetent managers.

link

SomeStupidPoint 3397 days ago

Amazon has metrics so they can tell a story, not so they can measure things.

As a cute example, one of their senior people (in a stats heavy role) couldn't explain how they'd detect if people wanted to be able to automatically order socks and tshirts on a buying cycle outside of what I call the "scheduling horizon", eg every 3-6mos. (Things I need regularly, but sparsely enough it doesn't stand out to do proactively -- eg, I buy socks when they all have holes, not on a reasonable replacement cycle.)

link

KnoopKnoop 3397 days ago

Yup probably some incentives due to SLA's for their larger customers.

link

mschuster91 3397 days ago

> Also, there are incentives based on colors, so the managers really don't want to admit any failure.

A textbook case of "wrong incentives". #1 incentive should be satisfied customers.

link

TeMPOraL 3397 days ago

You can have a high level of customer satisfaction if you lie to them and arrange so that they don't even notice, and by having a good damage control strategy for when some customers do notice they aren't getting that they were promised.

Such approach has better ROI than actually doing high quality products or services, which is why so much of what we buy is utter shit. That's especially true on the mass market, when satisfaction of individual customers doesn't impact your company at all, as long as they're not complaining too loud.

link

oneeyedpigeon 3397 days ago

Are any incentives affected by customer complaints? Because if so, all we need to do is complain and they might actually start using the status system meaningfully. (I'm sure this is wilful thinking, though; I doubt people don't complain in this situation!)

link

fletom 3397 days ago

when you start investigating an outage, that is exactly when you should change your checkmark to yellow if not red. if you're as big as AWS there should not be any more than a minute or two between when your service goes down and when you actually update your status page to show that.

link

CaptSpify 3397 days ago

It should actually just be automated.

link