Hacker News new | ask | show | jobs
by UglyToad 1889 days ago
So I've been mulling this stupid thought for a while (and disclaimer that it's extremely useful for these outage stories to make it to the front-page to help everyone who is getting paged with p1s out).

But, does it really matter?

I read people reacting strongly to these outages, suggesting that due dilligence wasn't done to use a 3rd party for this or that. Or that a system engineered to reach anything less than 100% uptime is professional negligence.

However from the top of my head we've had AWS outages, Gmail outages, Azure outages, DNS outages, GitHub outages, whatever else. All these hugely profitable companies are messing this stuff up constantly. Why are any of us going to do any better and why does a few hours of downtime ultimately matter?

I think it's partly living somewhere where a volcano the next island over can shut down connections to the outside world for almost a week. Life doesn't have an SLA, systems should aim for reasonable uptime but at the end of the day the systems come back online at some point and we all move on. Just catch up on emails or something. I dislike the culture of demanding hyper perfection and that we should be prepared to do unhealthy shift patterns to avoid a moment of downtime in UTC - 11 or something.

My view is increasingly these outages are healthy since they force us to confront the fallibility of the systems we build and accept the chaos wins out in the end, even if just for a few hours.

21 comments

Yes and no, some things are actually time sensitive.

For example, I'm building a note-taking / knowledge base platform, and we were having some reliability issues last year when our platform and devops process was still a bit nascent. We had a user that was (predictably) using our platform to take notes / study for an exam, which was open book. On the day of her exam our servers went down and she was justifiably anxious that things wouldn't be back before it was time for her exam to start. Luckily I was able to stabilize everything before then and her exam went great in the end, but it might not have happened that way.

Of course most on HN would probably point out that this is obviously why your personal notes should always be hosted / backed up locally, but I of course took this as a personal mission to improve our reliability so that our users never had to deal with this again. And since then I'm proud to say we've maintained 99.99% uptime[1]. So yes, there are definitely many situations where we can and should take a more laid back approach, but sometimes there are deadlines outside of your control and having a critical piece of software go offline exactly when you need it can be a terrible experience.

[1] https://status.supernotes.app/

> Of course most on HN would probably point out that this is obviously why your personal notes should always be hosted / backed up locally

And they would be right. Having your notes pushed up to the cloud is great and I use a feature like that all the time (specifically with iCloud and either the Notes app or beorg), but the most recent version of these documents should always be available offline.

Is your application unavailable without a network connection? What if you go somewhere without reception?

Yep, for the moment it is unavailable without a connection. Luckily most people are connected all the time these days, so it hasn't actually been a sticking point for any of our users so far. But yes, we agree that having offline is also super important, so we're building that out as well.

We wanted to build a platform that had collaboration in mind from the beginning though, which is why we opted to go for online-only initially – kicked the tough engineering problem of eventual consistency (when collaborating) down the road a bit so that we could work on features that were actually unique to our system (it's just two of us at the moment).

> But, does it really matter?

This is a great line of thought, I'd encourage everyone to take it. There's a huge amount of crap people get up to that is mostly about performative debt balancing - people feel that they're owed something just because <fill in the blank>, when it really didn't matter. Just another gross aspect of a culture overly reliant on litigation for conflict management.

But. the question is meaningless without qualifying, for whom?

Because I can absolutely imagine situations where an Auth0 outage could be extremely damaging, expensive, or both. Same for a lot of other services.

> Life doesn't have an SLA

Nope. Which is a part of the reason why people spend money on them for certain specific things. It is just another form of insurance against risk.

For a lot of stuff I agree but the problem is that (some of) these platforms advertise themselves as being built so that this should not happened. Less cynical engineers will then build some critical solutions that depend on these platforms and assume that they can and have successfully mitigated the risk of downtime. Sometimes the tools to manage/communicate/fix the service downtime are even dependent on the service being up.

The lesson is more that everything fails all of the time and the more interconnected and dependent we make things the more they fail. That is not something that can be solved with another SaaS as multiple downtimes, hacks, leaks and shutdowns have shown time and time again.

The point that often these services advertise on basis of resiliency is a fair one and I agree with what I think I'm reading into your conclusion which, if correctly understood, is that by increasing the number of dependencies in our systems we're exposing ourself to a compounding amount of downtime. And I'd assume we'd agree that generally we should architect towards fewer points of failure?

My reaction was more against the performative "haha, foolish n00b developers didn't build their system to use both Lambdas and Google Cloud and then failover to a data center on the North Pole like me, the superior genius that I am" that oftentimes appears in threads about downtime.

We could all do with a bit more "there but for the grace of god" attitude during these incidents while still learning lessons from them.

> And I'd assume we'd agree that generally we should architect towards fewer points of failure?

Yes, and to me that generally means having less points in total. We can make stuff pretty resilient but it's very hard and requires huge resources, so it's usually easier and simpler to just not have as many points at all instead of trying to add "more resilient" points in the form of SaaS.

In this case, a lot of apps are useless if the auth is down, and the auth is useless if the app is down so moving auth to something more resilient (if we assume this was an isolated incident and auth0 is generally good) only adds a point of failure and does not gain anything in terms of uptime. Especially since in more traditional setups the auth is usually hosted on the same server, on the same database and within the same framework as the app itself.

The problem is that the "small guy" is held to a high standard that the "big guy" isn't held to. If AWS shits itself for a day nothing will happen, if your small SaaS goes down for an hour you'll lose customers and people will yell at you.

And more importantly, if YOU try to use something "not big" and it goes down, it's on YOU - but if you're using Azure and it goes down, it's "what happens".

I think you're underestimating the scope of the impact and just how vital software is in the modern world. It's not just that people can't login to a system, it's that they simply can't get their work done, and some of that work is really very time sensitive and important. Auth0 is depended on by hundreds of thousands of companies. Tens of millions of people will have been impacted by this outage today.
I think it's actually because I'm beginning to realise how much I used to believe the importance of software and how maybe I no longer do.

For context I used to live in the UK which is probably, outside of South East Asia one of the most "online" societies (and miles ahead of the US in terms of things like online payments processing). I never carried cash, online orders for everything, etc.

I moved to Barbados towards the end of last year and let's just say there's a lot of low hanging fruit for software systems here. It takes about 4 months to get post from the UK, you can't really get anything from Amazon. There's a single cash machine that takes my card and sometimes it's out of money or broken and you can't open a bank account without getting a letter from your bank in the UK, with the aforementioned 4 month delay. Online banking doesn't exist. There was maybe 1 Deliveroo type service that was actually a front for credit card scamming and maybe 1 other food delivery app.

In a sense it has been so much more pleasant than life in the UK and not just because of the cheap beer and sunshine. If I have a problem I know my neighbours to speak to. I know the people in the bar, I know who can help me out if I ran out of money or needed food to tide me over.

This is all a bit 'trope of the noble savage', as if life was better off before all that technology or something. I don't believe that's the case however I also believe over-reliance/the belief in always-up systems reduces societal resilience. Certain things have to work, you have to be able to phone the ambulance and it comes (or alternatively know someone who could drive you to the hospital in a pinch), food has to get shipped in at some point, since a diet of cane sugar alone won't be sufficient. And for that supply chain technology, etc. is important. But there are many other types of software regarded as "vital" that I don't think are and the criteria for what is vital is actually a lot stricter than it can feel. And there's a lot more room for delay than we'd maybe feel when caught up in the tech bubble.

I appreciate this view, but I'm in academia, and with covid19 we are teaching remotely, doing exams remotely, etc. If the systems are down that can have a real disrupting effect on students not being able to submit homeworks/exams, us delivering lectures. And that potentially applies for the whole university (thousands of people).
To extend the OP's line of thinking does it really matter. Exams can be rescheduled, extenuating circumstances taken into account. As someone that has fallen ill quite suddenly through examination periods due to chronic illness I never appreciated the dogmatic approach taken when administering tests. I'm a human being, things happen, systems go down...
What I meant is when whole systems go down (i.e. canvas, blackboard, office365 or similar) as opposed to the internet for one person, the problem is the amount of stress and extra-work inflicted on thousands of people is (I think) can be quite large. Sure, nobody died, it's nothing like that, it's just people get upset about it because it is something outside your control and affects many people.
> Exams can be rescheduled, extenuating circumstances taken into account. As someone that has fallen ill quite suddenly through examination periods due to chronic illness I never appreciated the dogmatic approach taken when administering tests. I'm a human being, things happen, systems go down...

The problem is that any "leeway" will be taken up by cheaters. And the cheaters far outnumber the people like you who genuinely need some slack.

When I taught, I tried not to be dogmatic. But people have to understand that when a prof gives leeway, he's putting his ass on the line ... he doesn't have authority to do that and he could get burned if someone gets riled up about it.

So, if your prof cuts you slack that you needed, keep it to yourself and STFU.

> And the cheaters far outnumber the people like you who genuinely need some slack.

Citation needed.

I suggest reading "Human Kind" by Ruther Bregman, which is an interesting (and substantiated) counterargument to this idea.

> Citation needed.

My class.

Quote from student: "Thanks for having the best class."

Reponse from me: "Best?! You're getting clobbered in my class."

Quote from student: "Yeah, I'm not doing that well, but the bullshitters who always manage to butter up the Professor and skate through are actually failing for the first time ever. Everybody knows where they stand in your class. And, they know that if they put in the work they get the grade and if they don't, well, they get hammered."

Response from me: "Thanks, I guess?"

I considered it a compliment only because my father who taught high school for almost 4 decades said: "You're teaching a class. The students have to think you know your material, and they have to think you are fair. Nothing more. If they like you and/or respect you, so be it ... but those are non-goals. Your goal is to teach them the material, not be their friend."

That sounds great but it's second hand anecdata and says absolutely nothing about the ratio of cheaters:non-cheaters.
I understand your point. But, and forgive the vagueness and wooliness of my thoughts around this subject, does this not highlight the "software has made everything shit"-ness of academia? Wouldn't a little less software or a little more downtime be good here?

Instead of being able to make a judgement call or respond appropriately to changing circumstances; instead of being relied upon for your ability to judge the needs of your students accurately, you risk being flagged up for not sticking to protocol in matters of ~student~ consumer interaction.

If a cheater slips through, does it matter, that much, if the cheating is getting an extra few days of time to complete an assignment?

Aren't universities meant to be about expanding knowledge, places of learning? Aren't we making a mockery of the whole idea of tertiary education getting so caught up in catching 'consumers' gaming the system and the risk debasing 'consumer currency points' or exam scores in order to justify the busywork of admin departments? Software and software enabled culture is incredibly powerful but it also removes human factors and discretion and has made many things worse.

> If the systems are down that can have a real disrupting effect on students not being able to submit homeworks/exams

Pre-COVID, schools shut down for other reasons, like snow days. Doesn't seem much different.

Not regarding this specific incident, but to reply to this:

> However from the top of my head we've had AWS outages, Gmail outages, Azure outages, DNS outages, GitHub outages, whatever else. All these hugely profitable companies are messing this stuff up constantly. Why are any of us going to do any better and why does a few hours of downtime ultimately matter?

I've been mulling this for a while too, and I think I might have some responses that address your thought somewhat:

- Amazon/Google/Microsoft/etc. services have huge blast radii. If you build your own system independently, then of course you probably wouldn't achieve as high of an SLA, but from the standpoint of users, they (usually) still have alternative/independent services they can still use simultaneously. That decoupling can drastically reduce the negative impact on users, even if the individual uptimes are far worse than the global one.

- Sometimes it turns out problems were preventable, and only occurred because someone deliberately decided to bypass some procedures. These are always irritating regardless of the fact that nobody can reach 100% uptime. And I think sometimes people get annoyed because they feel there's a non-negligible chance this was the cause, rather than (say) a volcano.

- People really hate it when the big guys go down, too.

I think, though I might be way off base, your comment surfaces something that drives a lot of the, to me, over-the-top reaction to these outages. And that's the way (for AWS/Azure/GCP/Cloudflare) they reveal how the big 3/4 actually have eaten the "old internet" and how obvious downtime makes it.

Like this isn't a space for hobbyists or people just doing things in a decentralized manner anymore. The joke from British TV Sitcom 'The IT Crowd' where (the bigwigs are sold the lie that) the internet is a blinking black box in the company offices is actually true. Like something goes wrong with some obscure autoscaling code and actually, the little black box did break the entire internet.

I'm the kind of person who hates AWS and wants to live in the woods eating squirrels, but I can't really begrudge them downtime.

> why does a few hours of downtime ultimately matter?

In our case ( Azure downtime), because none of our customer systems would work.

This includes people on the road, that need to do something every 5 minutes on their PDA ( sometimes 100 people simultaneous in a big city)

So yes, it matters.

It doesn't matter though. In the end. What happens if that person doesn't do that very unspecific thing every 5 minutes on their PDA? Can they not complete their job still? Does the parcel not get delivered unless it is logged in the system the second it is delivered? Maybe so, maybe the driver steals it, taking advantage of the chaos of the system. Do they not go higher up the chain? Does the delivery company not have insurance? It can go endlessly but in the end. It doesn't even matter.

I happened to work in designing critical infrastructure for emergency services. We always had a failure in the plan, which is why part of our deliverable was a protocol for paper logging of the calls (ambulance, police, military...) and the subsequent following of the case. It worked amazingly when the system did go down. In part because it was roleplayed, in part because the system went down in a rather convenient time. The data was then added to the digital logs, and all was well in the world, including the people saved by the, and I kid you not, pen, and, paper... and other humans gasp

Yes it matters. Since we can't do it later ( only when a 3rd party is down, we can do it later)

They can't complete their job and no, it can't be done later since the opportunity to execute it is time-sensitive. It's one of the things we optimize for.

In a country like France, there's a discussion specification for it and it would get a lot of hassle.

We aren't delivering packages....

Eg. One of the reasons it matters, is that it would lose clients business and be taken into account within 4 years ( city tender... )

It's not because "people don't die", that it doesn't matter. A lot of jobs, cities and companies are dependant on what we do.

And jobs matter. So I think your statement is fundamentally flawed.

Out of interest, obviously you can't give too much away, what would happen if the users didn't/couldn't do that? The only situation that comes to mind is delivery drivers needing to get next destinations/mark deliveries completed but I'm maybe missing others.

I'm just hoping the people building the ambulance dispatch networks aren't using Azure :laughing:.

> I'm just hoping the people building the ambulance dispatch networks aren't using Azure :laughing:.

> I'm just hoping the people building the ambulance dispatch networks aren't using Azure :laughing:.

Hi, just happened to see your reply after I posted mine, and wanted to maybe give just a little bit of insight. Now, this might not be the case where you are from, but in my experience, ultimately, if all systems go down, there are protocols put in place for radio communication.

We always built tools taking into account existing protocols, so they can map 1:1 (you can imagine, you can't exclude any mission protocol because the product owner thinks the screen looks better without it) but also allow for the change of protocols. For all these services, it was the military structures that truly had the functionality core, which was mapped to what they could do without any technology in case of an emergency. Which is a damn lot.

So, I feel like I'm going a bit far here, but rest assured, the people building the ambulance dispatch networks probably build them on top of systems that work with powers off. So Azure going down, or not, it doesn't really matter.

Haha, thank you, I'm going to sleep a lot better at night now!

Your post was a really interesting insight into these systems, thank you.

In this case, the city misses income.

I'm allowed to speak about it. But i rather not in an online audience, just to be sure.

Even if that were true for a single system in isolation, it breaks apart quickly the number of services you’re ‘dependent’ increases. Then that relatively rare downtime of 1% starts to grow until every day, ‘something’ is broken.
> Why are any of us going to do any better and why does a few hours of downtime ultimately matter?

The answer is surprisingly simple.

Most outages are the unintended result of someone doing something. When you are doing things yourself, you schedule the “doing something” for times when an outage would matter least.

If you are the kind of place where there is no such time, you mitigate. Backup systems, designing for resiliency, hiring someone else, etc.

I agree with you. Sometimes things break, such is life. What I don't fully understand is that when people choose to outsource a critical part of their infrastructure and then complain when it happens to be down for a bit. It was a trade-off that was made.
> But, does it really matter?

I think an important consideration here is that a huge amount of time, money, and resources is spent on making sure the computers stay powered and cooled in all manner of situations. We contract redundant diesel delivery for generators, we buy and install gigantic diesel generator systems which are used for just minutes per year, huge automatic grid transfer switches, redundant fiber optic loops, dynamic routing protocols, N+1 this and double-redundant that. It's tremendously expensive in terms of money, human time, and physical/natural resources.

The point is that we are always striving to plan for failures, and engineering them out. When there is a real life actual outage, it means, necessarily, based on the huge amount of time and money and resources invested in planning around disaster/failure resilience, that the plan has a bug or an error.

Somebody had a responsibility (be it planning, engineering, or otherwise) that was not appropriately fulfilled.

Sure, they'll find it, and update their plan, and be able to respond better in the future - but the fundamental idea is that millions (billions?) have been spent in advance to prevent this from happening. That's not nothing.

I can definitely get on-board with this. When AWS or Azure has some outage they pull me into calls and ask me what to do. These vendors are so large it's like asking me for my advice on the weather. Everything is screwed, man. Just hunker down and go read a book or something.
I agree. I actually wrote something up about this back in 2015: https://www.rdegges.com/2015/obsessing-over-availability-is-...
This was a fantastic post, it covers a lot of the things I've been thinking about but in a comprehensible and readable way. I see it has been submitted here before but not gained much traction, do you mind if I submit it again?
I agree with this sentiment. Though there is of course a bit of a problem when you're dealing with people who don't.

I'd also highlight that when the big players go down people 'know' it's not your fault, when a small 3rd party provider goes down taking part of your service with it it's 'because you didn't do due diligence' or were trying to save a buck. Similar in a way to the anachronism 'no one got fired for buying IBM'

> why does a few hours of downtime ultimately matter

I think people know this implicitly, but it's good to think about it explicitly. Does downtime matter, and how much is acceptable should be a question every system has decided on. Because ultimately uptime cost money, and many who are complaining about this outage are likely not paying anywhere near what it would cost to truly deliver 5+x9s or Space Shuttle level code quality.

That's a lovely viewpoint to be able to take about one's own priorities, but one that's hard to sell to the person at the entity, ultimately paying all your bills.

Yes, people should relax a bit, but those incidents you cite did cost those companies customers. That's okay for Amazon. But a small B2B service provider can't as easily absorb the loss.

> Just catch up on emails or something.

Hard to do when you can't authenticate to the email webapp.

We build these massively distributed, micro-concerned, mega-scaled systems, and at every step we recognize everything and anything can go wrong at any given moment, mulling over these problems on a daily basis.

And then it /does/ and all of us lose our shit haha.

All the sharding and YAML dark-arts in the world won't save us when the SSL cert renewal fails because the card has expired and the renewal reminder went into someone's spam.
This is a really interesting point that I hadn't considered before.

It's similar to ubiquitous next day delivery conditioning people to find anything longer unacceptable, when cheap next day is quite new and not even the norm yet.

No point in the post. I get horribly anxious when my food delivery takes just a little bit more than the estimated time, which is already in the 40 minute range, so pretty low. Then after I eat, I think about how spoiled I am by society, and how crazy it is that from the moment the impulse leaves my brain, it takes less than an hour for me to get whatever food I want...
Ah, a comment where I can put on my SRE (Site Reliability Engineering) hat :)

You're completely right that a 100% availability is unreasonable and often times, never required despite what a customer or site operator may believe.

Just a quick aside, availability (can an end user reach your thing) is often confused with uptime (is your thing up). If I operate a load balancer that your service sits behind and my load balancer dies, your service is up, but not availabile for those on the other side of said load balancer.

With that in mind, Hacker News could be theoretically up 100% of the time but if I go through a tunnel while scrolling Hacker News on my mobile phone, from my perspective, it is no longer 100% available, it is 100% - (period I was without signal) available, from my personal perspective as a user.

The point here is that a whole host of unreliable things happen in every day life from your router playing up to sharks biting the undersea cables.

With that in mind, you then want to go and figure out a reasonable level of service to provide to your end users (ask for their input!) that reflects reality.

It's worth noting too that Google (I don't love 'em but they pioneered the field) will actually intentionally disrupt services if they're "too available" so as to keep those downstream on their toes. It's not actually good for anyone if you have 100% availability in that they make too many assumptions and also, it's just good practice I suppose.

I can recommend reading the SLOs portion of the Google SRE book if you're curious to see more: https://sre.google/sre-book/service-level-objectives/

In short, an SLO is just an SLA without the legal part so a guarantee of a certain level of service, often internally from one team to another.

Ideally these objectives reflect the level of service your customers (internal or external) expect from your service

> Chubby [Bur06] is Google’s lock service for loosely coupled distributed systems. In the global case, we distribute Chubby instances such that each replica is in a different geographical region.

> Over time, we found that the failures of the global instance of Chubby consistently generated service outages, many of which were visible to end users. As it turns out, true global Chubby outages are so infrequent that service owners began to add dependencies to Chubby assuming that it would never go down. Its high reliability provided a false sense of security because the services could not function appropriately when Chubby was unavailable, however rarely that occurred.

> The solution to this Chubby scenario is interesting: SRE makes sure that global Chubby meets, but does not significantly exceed, its service level objective. In any given quarter, if a true failure has not dropped availability below the target, a controlled outage will be synthesized by intentionally taking down the system.

> In this way, we are able to flush out unreasonable dependencies on Chubby shortly after they are added. Doing so forces service owners to reckon with the reality of distributed systems sooner rather than later.