Hacker News new | ask | show | jobs
by kichik 1209 days ago
Is there a website that tracks outages of other websites like Stack Overflow over years? I know some that tell you if it's down right now, but not over years.

I have a subjective feeling that Stack Overflow is down a lot more than other websites. I don't see that ever mentioned in the discussion of cloud vs on-prem which makes the discussion seem lacking.

2 comments

Seems to be testing from just one location, as far as I can tell?

Randomly, packets time out on the internet, I would take this random dashboard with a grain of salt, we cannot be sure SO had a outage just because one request happen to fail.

On the other hand, if they had a 'down for maintenance' page up, pings would still work
With a caveat that pingdom will mark "a connection from pingdom server from other side of the world to the server" as downtime, even if the target and your ISP, and the ISP of your ISP had no problems.
That’s an engineering choice not cloud vs. cloud. How many services are down when AWS us-east has a problem?
True. But cloud makes it a lot easier. In some cases it's built-in, like S3. In others it's a checkbox like RDS Multi-AZ. And if you need to roll your own, multi-AZ or even multi-region is much more straightforward than renting another rack somewhere.

I have personally seen Stack Overflow be "under maintenance" or straight up down a lot more than I have seen entire us-east-1 down.

Keep in mind that the "cloud" relies on an opaque control plane with undocumented failure modes (that sometimes even the provider does not know).

Just because you tick a checkbox doesn't mean it'll actually work as planned, and unlike infrastructure within your control that you can actually test (pull the network or power cable from a live server if you need to), you can't simulate a cloud provider outage.

> multi-AZ or even multi-region is much more straightforward than renting another rack somewhere.

Assuming that enough of the AWS control plane is alive to actually allow you to login and administer the services in your backup region.

Furthermore, cloud providers are their own businesses and are constantly in motion (introducing new features, etc). That's good for their business but bad for yours, as it means they might be doing risky changes that could affect you should it go wrong.

Exactly. I run a large enterprise service in a single datacenter with 5 years of 100% uptime. Our design goal is 99.97% measured monthly.

We have that because we have complete control end to end. We made an engineering decision not to have geo-redundancy because many of the dependent services aren’t available that way either.

Because of the compute requirements, running that service in AWS or GCP would cost about 80% more, inclusive of all costs (equipment, labor, utilities, etc)