Hacker News new | ask | show | jobs
by lngarner 1206 days ago
Hi! Thanks for asking. Basically, Status pages get updated manually, and people decide whether and when an outage is sufficiently bad to warrant a status page update. We monitor actual functionality and will capture smaller glitches that either escape human attention altogether or never get escalated to the point where the status page is updated.

In more detail, this can be for three reasons: 1.) We use functional testing so we're simply showing what aspects of the platform are working and what's not. Due to definitions of "outages" and such in SLA's, vendors like Datadog might not disclose/categorize certain dysfunctions as outages and so they won't show them on their status page. In other words, some outages might be more "minor" and they won't include them on the status page. 2.) Status pages are manual, Metrist is automatic. DD might not have updated or even be fully aware of the outage. Our tests are just showing the objective data as it's happening. 3.) Everyone experiences outages differently. This data from the demo is Metrist's experience with Datadog and can be slightly different from other people (another reason why status pages can be vague). That's why we have an orchestrator that allows people to set up personalized monitoring so they can know exactly how a vendor is affecting them in real-time. And if an outage is relevant to and affecting them.

Does that answer your question? LMK if I can follow up with more info. :)

2 comments

> Status pages get updated manually

This bugs me to no end. I don't want to name names but I had a devops service that was returning an odd error implying I was doing something wrong. Status page said everything was good. After several hours I emailed to be told it was actually down, they were aware, and were working on it. It eventually gets fixed, they email back, and all is well. The status page never did show any downtime.

Unless the status page show response times and is automatically updated when stuff stops working, assume that the status page is used as a marketing page. Companies who have nothing to hide run proper status pages, the rest that want to appear proper run marketing status pages that takes 30-60 minutes to even be updated in the first place.
Thanks for responding and providing details.

One follow up is there are instances where Datadog report outages but Metrist says it's green.

Is that because the functional tests are still working but some other part of Datadog was reported as down?

In most cases, vendors like Datadog may still manually say its service is still down, even if it's pretty much up and running just to make sure they don't speak too soon about being up and running again. But our tests can see that they are working even before the vendor is ready to announce they are functioning again. What a vendor reports generally isn't usually a real-time reflection of what's happening in their software. Updating the status page is like a press release about someone important recovering from an illness. We're like the medical equipment that monitors that person's health. The press has to take some time to make craft a message when they know the person is healthy and wait a moment to report to make sure the person doesn't relapse and they report health too soon. On the other hand, medical equipment is just there to measure health and it can show that way sooner than the press release. In other cases, Metrist mostly monitors essential functions right now and in the demo we monitor them from our point of view. So a minor part we don't monitor could be down but the major parts we do monitor are up. And so a status page may report certain part of the service as down while we just don't monitor that part. Further, since users experience outages differently and the demo is from our experience with the software, other users could be experiencing an outage while we aren't. So it's important for Metrist users to set up personalized monitoring so they know exactly how an outage is affecting them.