Hacker News new | ask | show | jobs
by opmac 2034 days ago
K so they avoided that problem, but something similar has obviously gone wrong again, considering that Kinesis had been partially or fully down for almost an hour before the status page got their first update.

And the fact remains that currently an outage of AWS's own infrastructure is impacting AWS's ability to status updates on its own status dashboard. It's just seems so... amateurish.

1 comments

That's incredibly annoying, given the mandate the replacement service had.

I'd be curious to be a fly on the wall during the next Ops meeting when it comes up that yet again the status dashboard got made in a way that makes it hard to update during an outage.

Maybe they should ask a question about resilient status page architecture among the ridiculous coding riddles they give candidates...lol!
Part of the problem is, engineers love shiny things.

Status pages are fundamentally boring things. Who wants to work on them?

It's always tempting to complicate something simple because in part "ooh shiny", and you can always find reasons to justify why. It takes some strong engineering leadership to effectively argue against complicating things, and not be just a constant pain in the arse to everyone and every thing.

The kinds of people that are that good, tend to be people that aren't going to want to do something so boring as build and maintain the infrastructure for hosting status pages.

>Part of the problem is, engineers love shiny things. Status pages are fundamentally boring things. Who wants to work on them?

I would work on a status page. It's a interesting problem, creating tests that prove services are viable at a place like AWS would be fun. However what I don't want to deal with is some director of so and so I never heard of yelling at me at 3 in the morning because my status page reported that his service was down accurately. I suspect that plays more into the problem. The status page is a political implement not a technical one.

Congratulations, you're already complicating the status page.

The status page shouldn't be figuring out what the status of any service is. It's impossible to do without a lot of contextual information about a service and understanding how to evaluate service impact, something that is continually in flux.

It just needs to be a page that is updated manually. AWS has a 24x7 incident management team that could / should do it.

Updated manually by whom?

I'm afraid you're shifting the complexity to a manual process.

I agree that it doesn't have to, and perhaps should not, be fully automated. But automating some parts will help not waste time on last minute arguments.