Hacker News new | ask | show | jobs
by gingerlime 1654 days ago
> I wish we would just throw up a generic "Shit's Fucked Up. We Don't Know Why Yet, But We're Working On It" message.

I think that's the crux of the matter? AWS seems to now have a reputation for ignoring issues that are easily observable by customers, and by the time any update shows up, it's way too late. Whether VPs make this decision or not is irrelevant. If this becomes a known pattern (and I think it has), then the system is broken.

disclaimer: I have very little skin in this game. We use S3 for some static assets, and with layers of caching on top, I think we are rarely affected by outages. I'm still curious to observe major cloud outages and how they are handled, and the HN reaction from people on both side of the fence.

2 comments

> disclaimer: I have very little skin in this game. We use S3 for some static assets, and with layers of caching on top, I think we are rarely affected by outages. I'm still curious to observe major cloud outages and how they are handled, and the HN reaction from people on both side of the fence.

I'd like to share my experience here. This outage definitely impacted my company. We make heavy use of autoscaling, we use AWS CodeArtifact for Python packages, and we recently adopted AWS Single Sign-On and EC2 Instance Connect.

So, you can guess what happened:

- No one could access the AWS Console.

- No one could access services authenticated with SAML.

- Very few CI/CD, training or data pipelines ran successfully.

- No one could install Python packages.

- No one could access their development VMs.

As you might imagine, we didn't do a whole lot that day.

With that said, this experience is unlikely to change our cloud strategy very much. In an ideal world, outages wouldn't happen, but the reason we use AWS and the cloud in general is so that, when they do happen, we aren't stuck holding the bag.

As others have said, these giant, complex systems are hard, and AWS resolved it in only a few hours! Far better to sit idle for a day rather than spend a few days scrambling, VP breathing down my neck, discovering that we have no disaster recovery mechanism, and we never practiced this, and hardware lead time is 3-5 weeks, and someone introduced a cyclical bootstrapping process, and and and...

Instead, I just took the morning off, trusted the situation would resolve itself, and it did. Can't complain. =P

I might be more unhappy if we had customer SLAs that were now broken, but if that was a concern, we probably should have invested in multi-region or even multi-cloud already. These things happen.

Saying "S3 is down" can mean anything. Our S3 buckets that served static web content stayed up no problem. The API was down though. But for the purposes of whether my organization cares I'm gonna say it was "up".
> We are currently experiencing some problems related to FOO service and are investigating.

A generic, utterly meaningless message, which is still a hell of a lot more than usually gets approved, and approved far too late.

It is also still better than "all green here, nothing to see" which has people looking at their own code, because they _expect_ that they will be the problem, not AWS.

Most of what they actually said via the manual human-language status updates was "Service X is seeing elevated error rates".

While there are still decisions to be made in how you monitor errors and what sorts of elevated rates merit an alert -- I would bet that AWS has internally-facing systems that can display service health in this way based on automated monitoring of error rates (as well as other things). Because they know it means something.

They apparently choose to make their public-facing service health page only show alerts via a manual process that often results in an update only several hours after lots of customers have noticed problems. This seems like a choice.

What's the point of a status page? To me, the point of it is, when I encounter a problem (perhaps noticed because of my own automated monitoring), one of the first thing I want to do is distinguish between a problem that's out of my control on the platform, and a problem that is under my control and I can fix.

A status page that does not support me in doing that is not fulfilling it's purpose. the AWS status page fails to help customers do that, by regularly showing all green with no alerts hours after widespread problems occured.

As mentioned in the article, internal metrics were fubar most of the day.
Who cares if it worked for your usecase?

Being unable to store objects in an object store means that it’s broken.