Hacker News new | ask | show | jobs
by blondie9x 1798 days ago
Why the heck are status pages so useless here?
5 comments

In order for status pages to be useful, they need to be hosted externally, with no dependencies on the product/system/company they report on.

They're already fairly hard to automate, but making them external makes it even harder. So they're generally not automated. And, hopefully, they're not exercised very often, so they're often forgotten. The best way to manage this, IMHO, is to enable customer service to update the status page and delegate the duty to them. Generally, updating the status page reduces the customer service demand, so CS gets the most direct and timely feedback and is best placed to manage it.

Source: I ran a status page, poorly. It wasn't fully external either, but was far away from all the moving parts, so it would only have correlated failure if our hosting had a multi-site routing issue (which happened once, but not while I was there) or DNS problems, but we had mitigations for DNS not working in the clients, so not too bad.

Because status pages are now pages that express politics regarding SLAs for executives and lawyers rather than useful technical tools for developers and sys-admins.
Internal politics. Anecdotes are that org leaders face potentially serious consequences if the products they’re responsible for are down per status page. So they find ways to cover it up. This culture is passed top down as everyone is incentivized to downplay widespread outage.
Yep, the actual problem is lies about service uptime in marketing content or during contract negotiations. "Well, all of our competitors claim five nines so we have to, too."
One pessimistic argument would be that AWS has no incentive to be honest or timely about acknowledging service outages.
I'd call that the realistic argument. What's the optimistic argument if that's pessimistic?

That an entity the scale of Amazon and AWS just can't figure out how to report system status? Or they forgot? Or didn't have the demand to create it?

> What's the optimistic argument if that's pessimistic?

They have some rudimentary automated checks that doesn't cover enough failure scenarios. After some time, given there are enough reports on twitter, they check manually if something is wrong and a human operator writes a status on the status page.

> That an entity the scale of Amazon and AWS just can't figure out how to report system status?

Yeah, that's literally what they've claimed in the past.. when the 2017 S3 outage happened, the issue wasn't reflected on the status page and they later claimed that they were having trouble updating it because the status page relied upon S3 somehow.

Maybe the 'optimistic' argument is more like the 'naive' argument...

I've seen this behavior at every place I've worked. You get in trouble when the status page goes red, therefore never say that anything is down unless you absolutely have to do so.