Hacker News new | ask | show | jobs
by bloodyplonker22 1204 days ago
I have always wished that companies would tie their status pages to their monitoring systems, but alas, they don't due to PR concerns and overall disingenuousness.
3 comments

To be successful this would require a very highly polished set of monitors, otherwise common situations like alert flapping would end up spamming updates. Having a human in the loop increases latency but can also increase the quality of the communication.

In my opinion, what you want from statuspage depends on your product and how your customers are using it. Automation strikes me as plausible if your product is a well-defined set of APIs but it’s borderline impossible for a complex web app with many functions which each may or may not be critical. SLO alerts can work here but presuppose you have good SLIs. And capturing good SLIs for a complex product is a whole lot of work that small companies won’t prioritize. All of this is to say, there are downsides to automation that have nothing to do with the biz-side bullshit you’ve correctly noted.

> To be successful this would require a very highly polished set of monitors, otherwise common situations like alert flapping would end up spamming updates. Having a human in the loop increases latency but can also increase the quality of the communication.

Preventing alert flapping is hardly "very highly polished", it's the base level for having even internal monitoring. And if it works for you internally, you could definitely make some automatic information available.

It's not rocket science we're talking about. Having a status page that can automatically say "Everything OK" or "Something seems to be not OK" would even be enough, but most companies don't offer something like that.

> Having a status page that can automatically say "Everything OK" or "Something seems to be not OK" would even be enough, but most companies don't offer something like that.

Any medium to large sized company will, at any given time, have dozens or hundreds of alerts going off. How do you know which ones should equal an "outage" on status.myco.com?

edit I just checked, my company currently has 2079 monitors in an ALERT status. Some of those are probably real ongoing issues, maybe even some customer facing issues...but there does not appear to be any reason to update our status page.

Thanks for replying, your assertion that this is not rocket science causes me to believe that your experiences do not overlap with mine. Alert flapping is a trivial example of a service health indicator being misaligned with reality. If the indicators you work with are always comprehensive, accurate, and precise that’s wonderful but it’s not a plausible reality for complex systems [1].

[1] https://how.complexsystems.fail/#5

I don't have any PR concern or disingenuousness, reality is that this is an exceedingly hard task to actually pull off with little or no payoff in the end.

For it to work, you have to know exactly which monitors could go into exactly which states (or range of values, etc) such that it constitutes a user facing outage that should be displayed publicly.

So you have to have automation around: How many users are affected? HOW are they affected? Is our homepage down but APIs still work? Is 1 API having issues but the other 100 just fine? Is there just some elevated latency but everything still works fine, or is it so elevated that we would call it an incident? Is it an internal tool that's broken? Is this a real alert or just an incidental alert due to a planned failover or a deploy and the metric loss is not indicative of user experience? Etc, etc.

It's truly an impossibly hard problem.

Alas INDEED