Hacker News new | ask | show | jobs
by eep_social 1201 days ago
To be successful this would require a very highly polished set of monitors, otherwise common situations like alert flapping would end up spamming updates. Having a human in the loop increases latency but can also increase the quality of the communication.

In my opinion, what you want from statuspage depends on your product and how your customers are using it. Automation strikes me as plausible if your product is a well-defined set of APIs but it’s borderline impossible for a complex web app with many functions which each may or may not be critical. SLO alerts can work here but presuppose you have good SLIs. And capturing good SLIs for a complex product is a whole lot of work that small companies won’t prioritize. All of this is to say, there are downsides to automation that have nothing to do with the biz-side bullshit you’ve correctly noted.

1 comments

> To be successful this would require a very highly polished set of monitors, otherwise common situations like alert flapping would end up spamming updates. Having a human in the loop increases latency but can also increase the quality of the communication.

Preventing alert flapping is hardly "very highly polished", it's the base level for having even internal monitoring. And if it works for you internally, you could definitely make some automatic information available.

It's not rocket science we're talking about. Having a status page that can automatically say "Everything OK" or "Something seems to be not OK" would even be enough, but most companies don't offer something like that.

> Having a status page that can automatically say "Everything OK" or "Something seems to be not OK" would even be enough, but most companies don't offer something like that.

Any medium to large sized company will, at any given time, have dozens or hundreds of alerts going off. How do you know which ones should equal an "outage" on status.myco.com?

edit I just checked, my company currently has 2079 monitors in an ALERT status. Some of those are probably real ongoing issues, maybe even some customer facing issues...but there does not appear to be any reason to update our status page.

Thanks for replying, your assertion that this is not rocket science causes me to believe that your experiences do not overlap with mine. Alert flapping is a trivial example of a service health indicator being misaligned with reality. If the indicators you work with are always comprehensive, accurate, and precise that’s wonderful but it’s not a plausible reality for complex systems [1].

[1] https://how.complexsystems.fail/#5