Hacker News new | ask | show | jobs
by null_content 2695 days ago
We don't tolerate houses collapsing out of nowhere, brakes failing over the course of normal usage and planes falling out of the sky during routine flights.

But for some reason, we HAVE TO tolerate software crapping itself once a year?

I don't accept this logic. This is just a sign of how sloppy the industry has become.

This is the reason your phone becomes obsolete after 2 years, whereas your car can continue to run after multiple decades of abuse.

6 comments

First: We’re not talking about “out of nowhere” or during “routine” operation. Doing better than 99.99% uptime implies robustness to even extreme, unusual situations.

Second: Air travel could be much, much cheaper if it didn’t have to be nearly 100% reliable. This would be the right trade-off to make in almost any application that doesn’t almost guarantee deaths when it fails.

We actually do tolerate it. Plenty of critical parts in your car are designed to not be 100% available even in all expected cases.

For example, plenty of higher end cars in california come with summer tires that can't be used in cold weather/ice.

Even the brakes you are talking about must be replaced every X miles (depending on how new the car is, this may be between 10k and 50k miles)

Houses are not definitely designed to be 100% available. This is in fact why they fail due to fire or earthquake or other events. The design point is not instant failure, but it's also not "100% available".

Like the SRE book says, they make a tradeoff.

I think this is a false equivalency. If we're talking about "service unavailability", planes break all the time. Houses have to be vacated because of flooding, fire, insect infestation. Brakes do fail. Just like with software, we accept a certain level of risk in exchange for cost/convenience efficiencies (e.g. we don't want our planes to fall out of the sky, but we're okay with getting stranded in phoenix for 24 hours because of a busted landing gear).
Also, brakes contribute to service unavailability. Brake pads need to be replaced on average every 50k miles, which takes the average driver 4 years. And let's say the average length of time your car is at the mechanic's to fix brakes is 3 days. That's 3 days of unavailability every 4 years just for brake pad replacements, or 99.8% availability (two nines!), just because of brake pad repairs. Add in all the other required car maintenance, and depending on the reliability of the vehicle, and you might be down into one nine territory.

Gmail going down is like your car being in the shop. It's not equivalent to a plane crashing; the equivalent there would be the entire contents and history of your Gmail account being unrecoverably deleted, and you yourself had no backups. Of course, I'd still much rather have that happen a hundred times than be in one fatal plane crash ..

Gmail seems to have 3 nines, although I couldn't find a better reference than this [1], where other services are included:

> [Google's infrastructure] delivers Gmail and other services to hundreds of millions of users with 99.978% availability and no scheduled downtime.

[1] https://support.google.com/googlecloud/answer/6056635?hl=en

PS. 99.978% availability translates as a downtime of ~ 2 hours/year total. Not bad! But it's when things break that we realize how performant and reliable they actually are.

Edits: various typos.

I'm consistently amazed how well Google and Facebook are at staying up. They're two services that I don't think I've really experienced a broad outage. Of course with Facebook's data designs, there's sometimes quirkiness as a result, but it's rarely completely off for me.

Google, I think I've only really noticed it offline once in the past 10 years or so. Not complaining at all.

Ok but like.... it takes <3 hours to change brake pads. So point taken, but numbers are more moderate than presented.
Consider that airplanes are relatively self-contained systems, whereas most of the systems we deal with in networking cross many different independent boundaries, each of which can independently fail for any number of reasons. There's more parties involved in regular operations of distributed software than in maintaining airplanes.
Which has do you know of that are constantly being worked on, and grown in size in perpetuity?

It's not a good metaphor, even if it looks intriguing at first sight.

We wouldn't tolerate those things if we all used the one plane, car and house. That's where this comparison falls down.
The e-mail equivalent of your house falling down is data loss. This is unavailability, which is more analogous to losing your keys and not being able to get in for 30 minutes.