Hacker News new | ask | show | jobs
by poxrud 1429 days ago
The fact that so many popular sites/services are experiencing issues due to a single AZ failure makes me think that there is a serious shortage of good cloud architects/engineers in the industry. It would be one thing if this was a Regional failure, but a single AZ failure should not have any noticeable effect.
17 comments

For most businesses a little down time here and there is a calculated risk versus more complex infrastructure. You can’t assume all the cloud architects are idiots — they have to report their task list and cost of infrastructure to someone who can give feedback on various options based on comparative resource requirements and risks.

Zone downtime still falls under an AWS SLA so you know about how much downtime to accept and for a lot of businesses that downtime is acceptable.

This. People working in IT naturally think keeping IT systems up 100% time is most important. And depending on the business it often is, but it all costs money. Running a business is about managing costs and risks.

- Is it worth to spend 20% more on IT to keep our site up 99.99% vs 99%?

- Is it worth to have 3 suppliers for every part that our business depends, with each of them being contracted to be able to supply 2x more, in case other supplier has issues? And pay a big premium for that?

- Is it worth to have offices across the globe, fully staffed and trained to be able to take on any problem, in case there's big electrical outage/pandemic/etc in other part of the world?

I'm not saying that some of those outages aren't results of clowny/incompetent design. But "site sometimes goes down" can be often a very valid option.

I've had some interesting discussions about this with a bunch of representatives of our larger B2B customers about this. Interestingly enough, to them, a controlled downtime of 2-4 hours with an almost guaranteed success is preferable, compared to a more complex, probably working zero-downtime effort that might leave the system in a messed up - or not messed up - state.

To them it's much easier to communicate "Hey, our customer service is going to be degraded on the second saturday in october, call on monday" to their customers 1-2 month in advance, prepare to have the critical information without our system, and have agents tell people just that.

This has really started to change my thoughts of how to approach, e.g. a major postgres update. In our case, it's probably the better way to just take a backup, shutdown everything, do an offline upgrade and rebuild & restore if things unexpectedly go wrong. We can totally test the happy case to death, and if the happy case works, we're done in 2 hours for the largest systems with minimal risk. 4 hours if we have to recover from nothing, also tested.

And you know, at that point, is it really economical to spend weeks to plan and weeks to test a zero downtime upgrade that's hard to test, because of load on the cluster?

At least in my experience, AWS downtime also only accounts for a minor share of the total downtime; the major source are crashes and bugs in the application you're actually trying to host. Being completely HA and independent of AZ crashes/bug is extremely hard and time intensive and usually not worth it compared to investing that time to get your app to run smoothly.
Yes but when someone else causes your downtime it’s fun to sit around and snipe at them for fun.
I think a good trade off, if your infra is in TF, is to be able to run your scripts with a parameterized AZ/region. That way you can reduce the downtime even more at a fraction of the cost. (assuming the services that are down are not the base layers of AWS, like the 2020 outage)
If you can get the data out of the downed AZ, don't have state you need to transfer and are not shot in the foot once the primary replica comes online again. I've rarely deployed an app where it was as easy as just to change a region variable.
Yeah the data stores are the ones that I would always keep multi AZ no matter what. Everything else is stateless and can be moved quickly.
Write an article on that because you make it sound simple. Or better yet, start a company that configures this for companies.
Yeah, makes sense if explicitly stated. Not everything is worth the money.

However, in my experience, the people doing the calculations on that risk have no incentive to cover it. Their bonus has no link to the uptime and they can blame $INFRA for the lost millions and still meet their targets and get promoted / crosshired.

The people who warned them and asked for funding are the ones working late and having conf calls with the true stakeholders.

This is true, but I think it would be more acceptable if the region were down vs the single AZ
Considering almost all of the services are multi-zone, it's not hard to add in a couple of lines to make them resilient against this.

People are just unaware, and probably making bad calls in the name of being "portable".

If your application and infra can magically utilize multiple zones with “a couple lines”… then I would say you are miles ahead of just about every other web company.
> you are miles ahead of just about every other web company.

I'm curious who these web companies are.

Use something like Lambda and you get multi-az for free.

https://docs.aws.amazon.com/lambda/latest/dg/security-resili...

Dynamo is another service that wouldn't be impacted as it is multi-az.

Getting postgres RDS multi-region would require the extra couple of lines in your CDK, but is fairly straightforward.

Today, a SaaS I’m familiar with that runs ~10 Aurora clusters in us-east-2 with 2-3 nodes each (1 writer, 1-2 readers) in different AZs had prolonged issues.

At least 1 cluster had a node on “affected” hardware (per AWS). Aurora failed to failover properly and the cluster ended up in a weird error state, requiring intervention from AWS. Could not write to the db at all. This took several hours to resolve.

All that to say that it’s never straightforward. In today’s event, it was pure luck of the draw as to whether a multi-AZ Aurora cluster was going to have >60 seconds of pain.

That SaaS has been running Aurora for years and has never experienced anything similar. I was very surprised when I heard the cluster was in a non-customer-fixable state and required manual intervention. I’ve shilled Aurora hard. Now I’m unsure.

Thank goodness they had an enterprise support deal or who knows if they’d still have issues now.

It's that easy for a lot of managed services.

Want GKE to run multi-zone, or Spanner to run multi-region, just check a box (and insert coin).

Or how about "I'm fully aware, I've done the math taking into account both cost and complexity of implementation and cost of downtime, and I'm probably making fantastic calls based on my actual needs."
If you had "done the math" then you would have gone serverless and gained multi-az for free, as it is almost always the cheapest option.
This has quickly grown to more than adding in a couple of lines! Now I need to architect my legacy app so that I can deploy into lambdas, then I can get resiliency I don't really need!

Not all systems require high availability. Some systems are A-OK with downtime. Sometimes, I'm perfectly fine with eventual consistency. You really do have to look at the use-cases and requirements before making sweeping staements.

I thought we were talking about cloud architects making poor decisions when designing solutions.

Where did legacy apps come from?

> Some systems are A-OK with downtime.

And those ones would not have cared about this outage. Your point isn't that clear.

Right, because magically serverless is the right answer for every application.
It gives me a bad gut feeling when you imply that multiple instances of a service is more complex than a single instance which cannot be duplicated easily.

I also disagree that it is inherently more costly to run a service in multiple locations.

Of course it's more costly, you need to ensure state between locations so by virtue there's more infra to pay for.

It's not just a single instance too, there's generally a lot more infrastructure (db servers, app servers, logging and monitoring backends, message queues, auth servers... etc)

Also, people who can configure and maintain that infrastructure. It is more complicated, and it does require a different sort of person.

(And checkbox-easy is sweeping edge cases and failure modes under the rug)

also inter region replication costs bandwidth money
Lots and lots of money.
How do you NOT pay more for running double of everything + load balancers?
You do not need to pay double for everything, that might have been true with traditional VPS providers but it is not the way it works with cloud services. You decide on what kind of failure you're willing to tolerate and then architect based on those requirements (loss of multiple AZ's, loss of a region, etc..).

Let's say your website requires 4 application servers, you can then tolerate a single AZ failure by using 5 application servers and spreading them among 5 AZs.

If you already have 4 application servers you are probably already AZ tolerant; most people concerned about "doubling everything" are only running 1 instance.

Going by your example, If your website requires 1 application server, to tolerate a single AZ failure, it requires you to double the number of application servers.

Example - we have a service that used Kafka in the affected region that went down. Our primary kafka instance (R=3) survived but this auxiliary one failed and caused downtime. There's no way around this other than doubling the cost.

In most cases the elephant* in the room is your DB - it doesn't matter where your stateless application servers are, if your stateful DB goes down you're in trouble. It's also often 1) the hardest to replicate, as replication involves tradeoffs - see CAP theorem & co and 2) the most expensive, since it needs to be pretty beefy in terms of CPU, RAM and IO - all very expensive on AWS.

*: https://commons.wikimedia.org/wiki/File:Postgresql_elephant....

That's true, when only dealing with 1 server, you technically double the cost by adding a second server. My original comment was about "popular sites/services", that should be able to tolerate the costs and are most likely dealing with multiple servers.

For a single server deployment you can still reduce your downtime (with minimal costs) by having the ASG redeploy into another AZ on a failed health check.

Those stateless app servers are the easy part. But you need to be replicating the data, with all the cost and complexity decisions that comes with it.
You should get into the database business. A lot of money to be made there if things are so trivial for you.
The sounds of crickets is deafening!
I’m sorry about your feelings but you are wrong.

its more expensive to have more things and it’s more expensive to have more complicated things that are also complex. And things that can fall over are inherently more complicated.

A multi-az deployment is a checkbox in most AWS services, e.g. ASGs, RDS, load balancers, etc. Someone didn't check that box because they didn't know about it, there isn't much complexity in it.
Aren't multi-az deployments more expensive? That would be a valid reason not to check this checkbox, if your business can survive a bit of downtime here and there.
Most of that expense is just the cost of a hot failover, but there is some additional cost around inter-AZ data transfer. If someone is not checking the boxes for cost reasons, I would be surprised if they had failovers in the same AZ. It seems more likely they just don't have failovers.
A checkbox that might 3-4x the cost.
multi az brings multi complexity in terms of data duplication, consistency, if your app wasnt designed to handle those kind of scenarios and experience high users loads then you are in for a lot of problems.

designing for those scenarios increase complexity; cost; architecture style and most of the time it will bring you in microservices territory where most of the companies lack experience and just are following best practices in a field where engineers are expensive and few

RDS just has a button for multi-AZ primaries. No complexity or microservices.
ok lets say you have a master in one AZ and it dies. what happens?
Automatic failover within 60 seconds.
> The fact that so many popular sites/services are experiencing issues due to a single AZ failure makes me think that there is a serious shortage of good cloud architects/engineers in the industry.

Not really.

What's more likely is that their companies have other priorities. Multi-AZ architectures are more expensive to run, but that's normally not the issue. What's really costly is testing their assumptions.

Sure, by deploying your system in a Kubernetes clusters spread across 3 AZs and a HA database you are supposedly covered against failures. Except that when it actually happened, turns out your system couldn't really survive a sudden 30% capacity loss like you expected, and the ASG churning is now causing havoc with the pods who did survive.

Complex systems often fail in non-trivial ways. If you are not chaos-monkeying regularly, you won't know about those cases until they happen. At which time it's too late.

Or, the redundancy actually causes a failure, so not only have you spent more money but you’ve reduced your availability doing so.

(Or worse, the redundancy causes a subtle failure like data loss.)

Nail on the head. The amount of times I've seen way overcomplicated redundancy setups which fail in weird and wonderful ways, causing way more downtime than just a simplier setup is pretty silly.
Don’t make the mistake of overromanticizing the simple solutions. They have nice, well understood failure conditions, and they come up relatively frequently.

When you start playing the HA game, the easy failures go off the table, and things break less often because “failures happen constantly and are auto-healed”. But when your virtual IP failover goes sideways or your cluster scheduler starts reaping systems because the metadata service is giving it useless data, you’re well into an infrequent, complex failure, and I hope you have a good ops team.

It’s always a trade off.

It's not so cut-and-dried. The AZ isolation guarantees are not quite at the maturity they need to be.

If you're using any managed services by AWS, you need to rely on their own services to be AZ fault-tolerant. In AWS speak, they may well be (just with elevated error rates for a few minutes while load balancing shifts traffic away from a bad AZ). But as an AWS customer, you still feel the impact. As an example, one of our CodePipelines failed the deployment step with an InternalError from CloudFormation. However, the actual underlying stack deployment succeeded. When we went to retry that stage, it wouldn't succeed because the changeset to apply is no more. It required pushing a dummy change to unblock that pipeline.

Similarly, many customers run Lambdas outside of VPCs that theoretically shouldn't be tied to an AZ. You're still reliant on the AWS Lambda team to shift traffic away from a failing AZ, and until they do that, you'll see "elevated error rates" as well.

I have 2 takes on this:

1) AWS is already really expensive, just on a single AZ. Replicating to a second AZ would almost double your costs. I can't help but bring up the point that an old-school bare-metal setup on something like Hetzner/OVH/etc becomes significantly more cost-effective since you're not using AWS's advantages in this area anyway (and as we've seen in practice, AWS is nowhere near more reliable - how many times have AWS' AZs gone down vs the bare-metal HN server which only had its single significant outage very recently? - it makes sense considering the AWS control plane is orders of magnitude more complex than an old-school bare-metal server which just needs power and a network port).

2) It is extremely hard to build reliable systems over time (since during non-outage periods, everything appears to work fine despite accidentally introducing a hard dependency on a single AZ), and even more so to account for second-order effects such as an inter-AZ link suddenly becoming saturated during the outage. I'm personally not confident at all in Amazon's (or frankly, any public cloud provider's) ability to actually guarantee seamless failover during an outage, since the only way to prove it's working is to have a real outage as to induce any potential second-order effects such as inter-AZ links suddenly becoming saturated, which AWS or any other cloud provider aren't going to do (as an intentional, regularly-scheduled outage for testing would hurt anyone who intentionally doesn't use multiple AZs, essentially pricing them out of the market by forcing them to either commit to the cost increase of multi-AZ or move to a provider who doesn't do scheduled outages for testing purposes).

Going bare-metal is a premature optimization. Most startups that go that route don't survive long enough to make use of this optimization.

Take advantage of AWS (or Azure, or DO) until you're big enough that bringing the action in-house is a financially and technically prudent option.

It’s premature when it’s premature. It’s late when it’s not.
As some others have alluded to, it seems common AWS services (the ones you rely on to manage multi-AZ traffic like ALBs and Route53) spike in error rate and nose dive in response time so it becomes difficult to fail things over. On top of that, services like RDS that run active hot standby then rely on those to fail over so it's difficult to get the DB to actually fail over.

I suspect, behind the scenes, AWS fails to absorb the massive influx in requests and network traffic as AZs shift around.

I would think regions with more AZs (like us-east-1) would handle an AZ failure better since there's more AZs to spread the load across

What's more surprising, imo, is the large apps like New Relic and Zoom that you'd expect to be resilient (multi region/cloud) taking a hit

Architect here. We had an outage and we have a very complete architecture. The issue is, the services were still reachable via internal health checks. So instead of taking the effected servers out of service they stayed in.

We had to resolve it by manually shutting down all the servers in the affected AZ. Which is normally not needed.

There are of course a lot of companies that aren't architected with multi-AZ at all (or choose not be be). Those companies are having an even worse time right now. But because the servers generally still appeared healthy, this can effect some well architected apps also.

Only reason we knew to shut them down at all was because AWS told us the exact AZ in their status update. We were beginning the processes of pinging each one individually to try to find them (because again, all the health checks were fine).

Yup, exact same here. All of the multi-AZ failover depends on AWS recognizing that their AZ is having an issue, and they never reported having an issue on any health-check so no failover ever happened. We started being able to make progress when AWS told us which AZ was having issues. It still took some time for us to manually shift away from that AZ (manually promoting ElastiCache replicas to primary, switching RDS clusters around, etc.) because all of the AWS failover functionality did not function as they should have and we were relying on that. Multi-region failover would have made us more fault tolerant but our infrastructure wasn't setup for that yet (besides an RDS failover in a separate region). Here's to hoping we never have a Route53 or global AWS API Gateway failure! Then even multi-region will not do us much good. Perhaps we should have some backup servers on the moon, then in case of nuclear warfare we can still be online via satellite.

P.S. AWS has said they have resolved the issue for almost 2 hours now and we are still having issues with us-east-2a.

Which internal health checks are you referring to?
Both the EC2 instance health and our HTTP health checks. If either of those failed the server would have been removed from the load balancer, but they didn't fail.

Only the external health checks that hit the system from an outside service were failing. And because those spread out the load across the AZs, only a fraction of them were failing and no good way to tell the pattern of failure.

I did have some Kubernetes pods become unhealthy but only because they relied on making calls to servers that were in a different AZ.

that tracks with our experience as well
It's always more complicated than just deploying EC2 instances into multiple-az's. Here are some things I noticed from today's events.

First: RDS. I saw one of our RDS instances do a failover to the secondary zone because the primary was in the zone that had the power outage. RDS failovers are not free and have a small window of downtime (60-120s as claimed by AWS[1]).

Second: EKS (Kubernetes). One of our Kubernetes EC2 worker nodes (in EKS) went down because it was in the zone with the power outage. Kubernetes did a decent job at re-scheduling pods, but there were edge cases for sure. Mainly with Consul and Traefik running inside of the Kubernetes cluster. Finally, when the Kubernetes EC2 worker node came back up, nearly nothing got scheduled back to it. I had to manually re-deploy to get pod distribution even again. Though the last issue might be something I can improve on by using the new Kubernetes attribute topologySpreadConstraints[2].

[1] https://aws.amazon.com/premiumsupport/knowledge-center/rds-f... [2] https://kubernetes.io/docs/concepts/scheduling-eviction/topo...

It's a game theory thing. If everyone stays single AZ, everyone goes down at the same time so nobody gets blamed. Somehow the blame falls on AWS instead!
I think you're confusing availability zones with regions in this comment.

AWS AZs don't even have consistent naming across AWS accounts.

That’s a feature, not a bug. If we all had the same number one ; then things would not be loaded anything close to evenly. There is some command to find out what the unique ID number is for your particular zones with your naming.
Clarification: 1/3 of sites will go down (those using the AZ that went offline), but my point is the same. Most companies aren't using multiple AZs, let alone multiple regions.
Best take lol
I don't think there's a shortage of people who can architect reliable services. I think companies simply put reliability on the back burner because it rarely bites them. It's the same reason technical debt is so rarely paid off.
> technical debt is so rarely paid off.

It's not debt if you don't have to pay for it -- and if the ongoing costs of whatever it is are relatively insignificant.

But technical debt bites you in every new feature by slowing new code addition.
There is a shortage of good cloud engineers, but even if there were more of them, the business doesn't give a crap about brief outages like this. Blame it on AWS and move on, business as usual. Even if they did care, the business is often too incompetent to understand that they could easily prevent these things. And even if they did realize it, they don't want to prioritize it over pushing out another half-baked feature, making sales, getting their bonus.
Multi-AZ architecture just double the cost at least, and it tends to cost even much more if the business is small. Good engineers find the balance between the cost and the availability.
No that is not correct, it is not double the cost, please see my reply above.
Salaries are a cost.
To an investor, a salary is a temporary cost ie, you pay the salary, get the TF scripts made, fire the employee while a checkbox driven, managed resiliency is going to cost you forever with no hope of ever eliminating that cost.

At least that’s what was recently told to me by my manager to explai why my employer prefers to hire people to self manage the AWS infra.

Wait. Are you saying that while AWS maintains multiple AZs they can’t maintain reliability on the failover systems between them?
Did you, by chance, reply to the wrong comment? Don’t think I said anything about failovers etc.

The point made to me was that a devops role can be made to eventually automate their own job away to an extent. To an investor, having a devops role on staff is acceptable.

If you never had a devops role and used AWS managed services, you can’t automate that and trim costs.

I.e., devops roles look like surplus in the system if they’re doing a worse job than managed services but to certain audiences, that surplus is necessary. So, if you’re looking to fundraise and your business has tight margins, don’t be too hasty to move to managed services.

Replicating a huge database between AZs, let alone regions, can be an enormously expensive ongoing operational cost. Not everyone can afford it.
Assuming you're using RDS then multi-AZ deployment is just a simple configuration option. If you're using Aurora then it is handled automatically and is even less expensive.
don't all the multi-AZ deployments imply at least 1 standby replica in a different AZ?
Aurora can replicate the data but doesn't have to keep a hot standby AFAIUI. You can then start a new instance in a different az but the process is semi manual.
Yes, my point was that it is not complex to setup and maintain, but it is not free.
I can tell you from experience that the cloud architects are world class and it's actually the data techs that are the problem. Amazon doesn't value data center techs, they don't pay competitively and hire techs that barely have enough skill so they can pay them nothing. Then they metric the fuck out of the teams so that everyone focuses on quick fixes instead of taking the time to troubleshoot long-term persistent issues. Couple this with the fact that management is only concerned with creating new capacity instead of fixing existing capacity.
Will need to see the post-mortem, when us-east-1 had its last big outage multiple AZs were working, but cross AZ functionality (lambda, event bridge) were impacted... which made recovery problematic.
Not looked into it too closely yet, but for us it looks like there were also issues connecting between the two remaining AZ in our 3 node cluster.
we definitely had issues with all of the AZs in east-2, and far more services impacted than just EC2 (f.e. RDS, elasticache were intermittently down for us)
Both RDS and elasticache run on EC2. But both of them have Multi-AZ options.
sure, just saying that only EC2 instances were impacted is disingenuous at best.

all of our production services are multi-az as well

My take is that so many sites are broken, maybe I shouldn't care either. The extra complexity of dealing with high availability is something that probably isn't worth it for my project. Spend more time on features instead.
Companies don’t want to pay for in house architecture/etc and developers are generally ultra hostile towards ops people.