Hacker News new | ask | show | jobs
by opmac 2036 days ago
It is kind of perplexing that AWS dogfoods its own status page. I remember during the massive S3 outage a few years ago that their status page remained green almost the entire time because the red/green/blue icons for the status was stored in... wait for it... S3.

You'd think they would have learned from that.

2 comments

They did. It came up in the post incident report, and senior leadership kicked off work to have it run on its own distinct infrastructure so that this wouldn't happen again.

If you look at where the content on https://status.aws.amazon.com/ is actually hosted from you'll see things like the status icons are all hosted under the same domain, e.g. https://status.aws.amazon.com/images/status1.gif https://status.aws.amazon.com/images/status0.gif etc.

If you look at the source code for the site, you'll again see that everything is hosted from the same domain.

One of their main goals was to ensure that it could never go wrong that way again.

Except they posted this: 7:30 AM PST: We are currently blue on Kinesis, Cognito, IoT Core, EventBridge and CloudWatch given an increase in error rates for Kinesis in the US-EAST-1 Region. It's not post on SHD as the issue has impacted our ability to post there. We will update this banner if there continue to be issues with the SHD.

(SHD being the Service Health Dashboard)

K so they avoided that problem, but something similar has obviously gone wrong again, considering that Kinesis had been partially or fully down for almost an hour before the status page got their first update.

And the fact remains that currently an outage of AWS's own infrastructure is impacting AWS's ability to status updates on its own status dashboard. It's just seems so... amateurish.

That's incredibly annoying, given the mandate the replacement service had.

I'd be curious to be a fly on the wall during the next Ops meeting when it comes up that yet again the status dashboard got made in a way that makes it hard to update during an outage.

Maybe they should ask a question about resilient status page architecture among the ridiculous coding riddles they give candidates...lol!
Part of the problem is, engineers love shiny things.

Status pages are fundamentally boring things. Who wants to work on them?

It's always tempting to complicate something simple because in part "ooh shiny", and you can always find reasons to justify why. It takes some strong engineering leadership to effectively argue against complicating things, and not be just a constant pain in the arse to everyone and every thing.

The kinds of people that are that good, tend to be people that aren't going to want to do something so boring as build and maintain the infrastructure for hosting status pages.

>Part of the problem is, engineers love shiny things. Status pages are fundamentally boring things. Who wants to work on them?

I would work on a status page. It's a interesting problem, creating tests that prove services are viable at a place like AWS would be fun. However what I don't want to deal with is some director of so and so I never heard of yelling at me at 3 in the morning because my status page reported that his service was down accurately. I suspect that plays more into the problem. The status page is a political implement not a technical one.

> It is kind of perplexing that AWS dogfoods its own status page.

> You'd think they would have learned from that.

They did.

The page has been updated numerous times since the start of this incident.

From the status page:

> This issue has also affected our ability to post updates to the Service Health Dashboard.

Just seems so ridiculous that they have trouble reporting the impaired status of their system due to... the impaired status of that same system.

it was 1.5 hours before the first service was put on yellow