| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by oooyay 992 days ago

First, congrats on the launch!

Why did you end up going with a SaaS model? 30 Euros or $31.50 USD is pretty expensive for something like a status site. You'd have a lot less to manage day to day and be able to focus more on innovating the product if you just sold the software, imo.

Why the focus on synthetic monitoring? As a SRE, I actively eschew synthetic monitoring. It's highly error prone and doesn't actually indicate regional availability. I'd like a status site that I could push a certain internally derived SLA for a given service to and the status site reflects the average over time of that windowed SLA.

SLA's are intended to incur customer refunds when they're violated if they're meaningful. If your synthetic monitoring shows an SLA of 4 nines but it was actually closer to 4.8 or 4.9 then you could be on the hook for causing your customers a good bit of legal pain. Just something to think about in this space.

Other status sites don't build external SLAs off of internal metrics because the process of deriving internal metrics that align with external outcomes is sufficiently difficult. Instead, they calculate an SLA based off of posted statuses over a period of time eg: Degraded, Down, Up. Supporting both modes could be a boon to potential customers.

Overall looks like a great start; good luck on your venture!

4 comments

lucgagan 992 days ago

> Why the focus on synthetic monitoring? As a SRE, I actively eschew synthetic monitoring. It's highly error prone and doesn't actually indicate regional availability. I'd like a status site that I could push a certain internally derived SLA for a given service to and the status site reflects the average over time of that windowed SLA.

As an end user, hard disagree.

GitHub is a great example of this. Their status almost always shows 100% uptime while the service is entirely unstable.

It is clear that their uptime SLAs do not align with end user experience.

As an end user, I care whether I can access and use the service. I don't care what broke in between.

link

oooyay 992 days ago

I suspect on GitHubs front this has to do with how they populate their status site. They may update it manually once they identify customer impact. If they're using internal metrics to qualify the status site then they're likely not using all of the needed metrics to reflect customer impact. There's also a third possibility which is that between you and GitHub there's something that causes a partition or failure that is outside of GitHub and your domain of control.

I agree with you that the ultimate value is in customer impact. I was saying "that's hard" but synthetic monitoring is not the solution because it doesn't achieve what it sounds like it achieves.

link

101008 992 days ago

I don't know much about statuses pages, I just check them to see if the services I use are having an issue. It's the first time I read about "synthetic monitoring", and from a quick Google search, it seems to referring to "automatic monitoring". A bsic versino of this would be to do a ping to see if the server is responding, or a HTTP request to see if it's returning a 200 status code.

However, if I read your comment carefully, you are suggesting to provide an alternative where the company (owner) could decide manually when a system is down or up. If that's the case, wouldn't the status page be just a page template where someone logs into a panel and toggle a button to say "down" or "up" and post updates? If there is no automatic monitoring, the service would look more like a blog/tumblr/twitter than anything else.

Or probably I am missing something because of my lack of experience and I am curious, I'd like to know!

link

oooyay 992 days ago

Good question. Status sites usually advertise the availability of features. When your service to feature mapping is 1:1 with just a load balancer or a cache in between then it's relatively simple to calculate. The number of 500s on the load balancer, cache, or both indicates errors sent to users. As a company grows several services usually combine to form a single feature; think about how a company has a "sign in" feature. There's likely a service that handles typical username password auth, then one for SSO, one for passkey, etc... at this rate, you have several inputs but the outputs remain somewhat consistent. 500s seen on your most externally facing endpoints are errors to users.

Now combine all of the above with a client that has retry capabilities. That client could be a modern web app or a desktop app. Eventually consistent systems often rely on retry behavior and rate limiting to achieve smooth user transitions. Now I can't simply rely on 500s being sent because they may indicate a timeout or a caching problem. Now I need to rely on statistics on specific endpoints that will definitely result in a user facing error. Collecting that in real-time (real-time enough for alerting, anyway) is challenging as a company at that scale could be dealing with an abundance of requests per second.

When SREs get into an incident they'll often try to determine customer impact in order to know what hemorrhaging to stop first. Looking at a list of 500s in a system like that is often unhelpful, so we'll build dashboards of specific endpoints that show a level of degradation eg: "Show me all requests that did not have 2xx where the number of retries is 3". In my contrived example the client shows an error after the third exponential retry. If you were calculating availability purely off of the number of 500s you're not actually calculating customer impact, you're calculating the number of errors. That said it's a lot easier said than done to build a data system to make a query like what I described, much less to export it. So in order to provide accurate information the status site is updated manually.

On the flip side of what you described, some errors don't have a statistic. For instance, if I force rotate everyone's password and kill logins then I might post that on the status site as well. If it's the result of a security action or vulnerability I might declare the service degraded for a period of time.

link

101008 992 days ago

Thank you very very much for taking the time to write this explanation. I learnt a lot today :)

link

tibozaurus 992 days ago

Thanks again !

Tbh we haven't thought of the sla violation

For region availability we are planing to add multi region check per Monitor

At the moment you can only set one region per monitor

link

paulddraper 991 days ago

> Why did you end up going with a SaaS model?

Convenience.

More companies want Datadog,etc than to manage Datadog,etc.

link