Hacker News new | ask | show | jobs
by vinayan3 1198 days ago
Surprised to see how many differences in disagreements between Datadog and Metrist if Datadog is down or not.

Anyone from Metrist able to explain this?

3 comments

Hi! Thanks for asking. Basically, Status pages get updated manually, and people decide whether and when an outage is sufficiently bad to warrant a status page update. We monitor actual functionality and will capture smaller glitches that either escape human attention altogether or never get escalated to the point where the status page is updated.

In more detail, this can be for three reasons: 1.) We use functional testing so we're simply showing what aspects of the platform are working and what's not. Due to definitions of "outages" and such in SLA's, vendors like Datadog might not disclose/categorize certain dysfunctions as outages and so they won't show them on their status page. In other words, some outages might be more "minor" and they won't include them on the status page. 2.) Status pages are manual, Metrist is automatic. DD might not have updated or even be fully aware of the outage. Our tests are just showing the objective data as it's happening. 3.) Everyone experiences outages differently. This data from the demo is Metrist's experience with Datadog and can be slightly different from other people (another reason why status pages can be vague). That's why we have an orchestrator that allows people to set up personalized monitoring so they can know exactly how a vendor is affecting them in real-time. And if an outage is relevant to and affecting them.

Does that answer your question? LMK if I can follow up with more info. :)

> Status pages get updated manually

This bugs me to no end. I don't want to name names but I had a devops service that was returning an odd error implying I was doing something wrong. Status page said everything was good. After several hours I emailed to be told it was actually down, they were aware, and were working on it. It eventually gets fixed, they email back, and all is well. The status page never did show any downtime.

Unless the status page show response times and is automatically updated when stuff stops working, assume that the status page is used as a marketing page. Companies who have nothing to hide run proper status pages, the rest that want to appear proper run marketing status pages that takes 30-60 minutes to even be updated in the first place.
Thanks for responding and providing details.

One follow up is there are instances where Datadog report outages but Metrist says it's green.

Is that because the functional tests are still working but some other part of Datadog was reported as down?

In most cases, vendors like Datadog may still manually say its service is still down, even if it's pretty much up and running just to make sure they don't speak too soon about being up and running again. But our tests can see that they are working even before the vendor is ready to announce they are functioning again. What a vendor reports generally isn't usually a real-time reflection of what's happening in their software. Updating the status page is like a press release about someone important recovering from an illness. We're like the medical equipment that monitors that person's health. The press has to take some time to make craft a message when they know the person is healthy and wait a moment to report to make sure the person doesn't relapse and they report health too soon. On the other hand, medical equipment is just there to measure health and it can show that way sooner than the press release. In other cases, Metrist mostly monitors essential functions right now and in the demo we monitor them from our point of view. So a minor part we don't monitor could be down but the major parts we do monitor are up. And so a status page may report certain part of the service as down while we just don't monitor that part. Further, since users experience outages differently and the demo is from our experience with the software, other users could be experiencing an outage while we aren't. So it's important for Metrist users to set up personalized monitoring so they know exactly how an outage is affecting them.
My guess would be that Metrist made one or more API calls that failed within a time-slice (hopefully more than one failure). They then mark the entire day orange or red and compare it to AWS's green. Which is true, for the entire day their status symbol was probably green.

The AWS team has a hard challenge of reporting availability and deciding when a system is not green across dozens of API use cases per service, hundreds of services, hundreds of data centers, dozens of availability zones, and millions of clients.

Metrist has no visibility into services internal SLA, SLO, and SLIs. [1]

[1] https://cloud.google.com/blog/products/devops-sre/sre-fundam...

Metrist seems to consistently rate "downtime" different than the various services, for better or worse.

Here are some examples where the SaaS says they are down/degraded, but Metrist thinks they're up:

https://app.metrist.io/demo/jira

https://app.metrist.io/demo/circleci

Here is another where Metrist thinks the service is down, but self-reportedly up:

https://app.metrist.io/demo/newrelic

Thanks for pointing that out! Since status pages are updated manually, we monitor actual functionality. We often see that pages functionally recover long before the status pages update that everything is in working order. Again, because it's manual and status pages are often more for marketing than development purposes. And also we're in "Show HN" and may not be 100% perfect ;) but we stick to the above explanation :)
That would explain the scenario when Metrist says something is down but the actual service doesn't say it, because it's manually updated.

But what about the reverse? In what scenario would the platform say something is down but Metrist says it's up? Metrist is fully automated as I understand it, so it should detect it faster and reliable than their manually updated status pages, right?