Hacker News new | ask | show | jobs
by Twirrim 2036 days ago
They did. It came up in the post incident report, and senior leadership kicked off work to have it run on its own distinct infrastructure so that this wouldn't happen again.

If you look at where the content on https://status.aws.amazon.com/ is actually hosted from you'll see things like the status icons are all hosted under the same domain, e.g. https://status.aws.amazon.com/images/status1.gif https://status.aws.amazon.com/images/status0.gif etc.

If you look at the source code for the site, you'll again see that everything is hosted from the same domain.

One of their main goals was to ensure that it could never go wrong that way again.

2 comments

Except they posted this: 7:30 AM PST: We are currently blue on Kinesis, Cognito, IoT Core, EventBridge and CloudWatch given an increase in error rates for Kinesis in the US-EAST-1 Region. It's not post on SHD as the issue has impacted our ability to post there. We will update this banner if there continue to be issues with the SHD.

(SHD being the Service Health Dashboard)

K so they avoided that problem, but something similar has obviously gone wrong again, considering that Kinesis had been partially or fully down for almost an hour before the status page got their first update.

And the fact remains that currently an outage of AWS's own infrastructure is impacting AWS's ability to status updates on its own status dashboard. It's just seems so... amateurish.

That's incredibly annoying, given the mandate the replacement service had.

I'd be curious to be a fly on the wall during the next Ops meeting when it comes up that yet again the status dashboard got made in a way that makes it hard to update during an outage.

Maybe they should ask a question about resilient status page architecture among the ridiculous coding riddles they give candidates...lol!
Part of the problem is, engineers love shiny things.

Status pages are fundamentally boring things. Who wants to work on them?

It's always tempting to complicate something simple because in part "ooh shiny", and you can always find reasons to justify why. It takes some strong engineering leadership to effectively argue against complicating things, and not be just a constant pain in the arse to everyone and every thing.

The kinds of people that are that good, tend to be people that aren't going to want to do something so boring as build and maintain the infrastructure for hosting status pages.

>Part of the problem is, engineers love shiny things. Status pages are fundamentally boring things. Who wants to work on them?

I would work on a status page. It's a interesting problem, creating tests that prove services are viable at a place like AWS would be fun. However what I don't want to deal with is some director of so and so I never heard of yelling at me at 3 in the morning because my status page reported that his service was down accurately. I suspect that plays more into the problem. The status page is a political implement not a technical one.

Congratulations, you're already complicating the status page.

The status page shouldn't be figuring out what the status of any service is. It's impossible to do without a lot of contextual information about a service and understanding how to evaluate service impact, something that is continually in flux.

It just needs to be a page that is updated manually. AWS has a 24x7 incident management team that could / should do it.