Hacker News new | ask | show | jobs
by redleader55 885 days ago
In my view, a status page should have only one function: communicate to your users if your service is up, how long has it been down, and which parts are down and maybe list work that you do fix the issues. Updating it should be done, automatically and as simple as possible, as part of incident response process.

A status page should not replace your internal monitoring, so including "batteries" is both not necessary and a bad idea - because of the next point.

A status page should not have dependencies, and if it does, they should have higher availability than your service. Otherwise, you need a status page for your status page. Node.js sounds like a liability in this case.

4 comments

> A status page should not have dependencies, and if it does, they should have higher availability than your service.

IMHO this is often unnecessary. The critical thing is for the failure modes of your status page to be uncorrelated with the failure modes of your service, so that you're unlikely to break both at the same time. But you might have, eg, some public API with a 99.995% availability target, and a status page with a 99.95% target - it of course depends on your situation but those numbers wouldn't strike me as being intrinsically wrong as long as the status page is properly independent of your service.

It should be timely as well, reddits status page will happily show all green while down detector for reddit shows huge spikes in error reporting.

Which doesn't make for a super useful status page.

I don't really know what people expect with status pages. Having it change automatically based on metrics can result in inaccurate status. Having it behind a manual gate can be inaccurate since that takes time with approvals and such.

So, what exactly is the expectation and how can you implement a perfect status system?

> Having it change automatically based on metrics can result in inaccurate status.

A status pages job is to inform users about potential issues. A user will seek out the status page specifically if they currently see issues on their end, but usually won't if they don't. Therefore underreporting is a huge issue, because you essentially tell your users that the issue must be on their end even if it is not, but overreporting issues when there are none hurts no one and the chances are high that no user even sees it.

Completely agree. I think the unofficial Steam Status[0] by xPaw[1] is a great example. I never go to the site unless I'm having issues. Between the service stats and the page views section it is really easy to confirm my suspicion that something is on fire at Valve. If it wasn't for this post I wouldn't have known that they had a minor connection issue a few hours ago.

[0]: https://steamstat.us/

[1]: https://xpaw.me/

This is actually a fair point. The batteries in this case are nice to haves, but to be fair I personally would not be using this as infrastructure frontline status reporting.. more so for a client/customer facing status feedback.
> Node.js sounds like a liability in this case.

that depends on what types of dependencies you're talking about

if you're talking about upstream servers/services, yep absolutely

but node.js dependencies (as in, libraries and packages), don't magically update by themselves. there's no reason node.js is a liability here unless you coordinate updating your service and status page dependencies at exactly the same time (which seems.. idiotic?)