Hacker News new | ask | show | jobs
by dangero 1429 days ago
For most businesses a little down time here and there is a calculated risk versus more complex infrastructure. You can’t assume all the cloud architects are idiots — they have to report their task list and cost of infrastructure to someone who can give feedback on various options based on comparative resource requirements and risks.

Zone downtime still falls under an AWS SLA so you know about how much downtime to accept and for a lot of businesses that downtime is acceptable.

6 comments

This. People working in IT naturally think keeping IT systems up 100% time is most important. And depending on the business it often is, but it all costs money. Running a business is about managing costs and risks.

- Is it worth to spend 20% more on IT to keep our site up 99.99% vs 99%?

- Is it worth to have 3 suppliers for every part that our business depends, with each of them being contracted to be able to supply 2x more, in case other supplier has issues? And pay a big premium for that?

- Is it worth to have offices across the globe, fully staffed and trained to be able to take on any problem, in case there's big electrical outage/pandemic/etc in other part of the world?

I'm not saying that some of those outages aren't results of clowny/incompetent design. But "site sometimes goes down" can be often a very valid option.

I've had some interesting discussions about this with a bunch of representatives of our larger B2B customers about this. Interestingly enough, to them, a controlled downtime of 2-4 hours with an almost guaranteed success is preferable, compared to a more complex, probably working zero-downtime effort that might leave the system in a messed up - or not messed up - state.

To them it's much easier to communicate "Hey, our customer service is going to be degraded on the second saturday in october, call on monday" to their customers 1-2 month in advance, prepare to have the critical information without our system, and have agents tell people just that.

This has really started to change my thoughts of how to approach, e.g. a major postgres update. In our case, it's probably the better way to just take a backup, shutdown everything, do an offline upgrade and rebuild & restore if things unexpectedly go wrong. We can totally test the happy case to death, and if the happy case works, we're done in 2 hours for the largest systems with minimal risk. 4 hours if we have to recover from nothing, also tested.

And you know, at that point, is it really economical to spend weeks to plan and weeks to test a zero downtime upgrade that's hard to test, because of load on the cluster?

At least in my experience, AWS downtime also only accounts for a minor share of the total downtime; the major source are crashes and bugs in the application you're actually trying to host. Being completely HA and independent of AZ crashes/bug is extremely hard and time intensive and usually not worth it compared to investing that time to get your app to run smoothly.
Yes but when someone else causes your downtime it’s fun to sit around and snipe at them for fun.
I think a good trade off, if your infra is in TF, is to be able to run your scripts with a parameterized AZ/region. That way you can reduce the downtime even more at a fraction of the cost. (assuming the services that are down are not the base layers of AWS, like the 2020 outage)
If you can get the data out of the downed AZ, don't have state you need to transfer and are not shot in the foot once the primary replica comes online again. I've rarely deployed an app where it was as easy as just to change a region variable.
Yeah the data stores are the ones that I would always keep multi AZ no matter what. Everything else is stateless and can be moved quickly.
Write an article on that because you make it sound simple. Or better yet, start a company that configures this for companies.
Nothing is inherently simple.

Depending on the size of the company it can be simple or hard. Most companies that need this are not huge. Things like RDS, Elasticache, ECR and Secrets have multi AZ integrated so not hard to do it. If you operate on ECS or EKS it's pretty straightforward to boot up nodes and load balancers in another AZ.

Maybe you have a system that requires more hands on work and want to explain your point of view? I don't appreciate the snarky responses tho.

Yeah, makes sense if explicitly stated. Not everything is worth the money.

However, in my experience, the people doing the calculations on that risk have no incentive to cover it. Their bonus has no link to the uptime and they can blame $INFRA for the lost millions and still meet their targets and get promoted / crosshired.

The people who warned them and asked for funding are the ones working late and having conf calls with the true stakeholders.

This is true, but I think it would be more acceptable if the region were down vs the single AZ
Considering almost all of the services are multi-zone, it's not hard to add in a couple of lines to make them resilient against this.

People are just unaware, and probably making bad calls in the name of being "portable".

If your application and infra can magically utilize multiple zones with “a couple lines”… then I would say you are miles ahead of just about every other web company.
> you are miles ahead of just about every other web company.

I'm curious who these web companies are.

Use something like Lambda and you get multi-az for free.

https://docs.aws.amazon.com/lambda/latest/dg/security-resili...

Dynamo is another service that wouldn't be impacted as it is multi-az.

Getting postgres RDS multi-region would require the extra couple of lines in your CDK, but is fairly straightforward.

Today, a SaaS I’m familiar with that runs ~10 Aurora clusters in us-east-2 with 2-3 nodes each (1 writer, 1-2 readers) in different AZs had prolonged issues.

At least 1 cluster had a node on “affected” hardware (per AWS). Aurora failed to failover properly and the cluster ended up in a weird error state, requiring intervention from AWS. Could not write to the db at all. This took several hours to resolve.

All that to say that it’s never straightforward. In today’s event, it was pure luck of the draw as to whether a multi-AZ Aurora cluster was going to have >60 seconds of pain.

That SaaS has been running Aurora for years and has never experienced anything similar. I was very surprised when I heard the cluster was in a non-customer-fixable state and required manual intervention. I’ve shilled Aurora hard. Now I’m unsure.

Thank goodness they had an enterprise support deal or who knows if they’d still have issues now.

It's that easy for a lot of managed services.

Want GKE to run multi-zone, or Spanner to run multi-region, just check a box (and insert coin).

Or how about "I'm fully aware, I've done the math taking into account both cost and complexity of implementation and cost of downtime, and I'm probably making fantastic calls based on my actual needs."
If you had "done the math" then you would have gone serverless and gained multi-az for free, as it is almost always the cheapest option.
This has quickly grown to more than adding in a couple of lines! Now I need to architect my legacy app so that I can deploy into lambdas, then I can get resiliency I don't really need!

Not all systems require high availability. Some systems are A-OK with downtime. Sometimes, I'm perfectly fine with eventual consistency. You really do have to look at the use-cases and requirements before making sweeping staements.

I thought we were talking about cloud architects making poor decisions when designing solutions.

Where did legacy apps come from?

> Some systems are A-OK with downtime.

And those ones would not have cared about this outage. Your point isn't that clear.

No, we were talking about architechts making decisions that you characterised as poor. I was pointing out that your statement was over-general and that there are many instances where making the informed decision to ignore HA is a completely reasonable thing to do.

By your last sentence, it appears you agree with me.

If you meant to say that your statement only applies to cloud architects who are attempting to maintain an uptime SLA with multi-az/region redundancy, then sure, AWS has lots of levers you can pull and those complaining really should spend some time studying them.

As for legacy applications, I would not have brought up them up at all if you hadn't suggested pushing things into lambdas as a solution to multi-az. Once again, there are many many situations where this is not appropriate. Not everything is greenfield, and re-architecting existing applications in an attempt to shoehorn it into a different deployment model seems a bit much. Unless I'm misunderstanding what you meant.

Right, because magically serverless is the right answer for every application.
It gives me a bad gut feeling when you imply that multiple instances of a service is more complex than a single instance which cannot be duplicated easily.

I also disagree that it is inherently more costly to run a service in multiple locations.

Of course it's more costly, you need to ensure state between locations so by virtue there's more infra to pay for.

It's not just a single instance too, there's generally a lot more infrastructure (db servers, app servers, logging and monitoring backends, message queues, auth servers... etc)

Also, people who can configure and maintain that infrastructure. It is more complicated, and it does require a different sort of person.

(And checkbox-easy is sweeping edge cases and failure modes under the rug)

also inter region replication costs bandwidth money
Lots and lots of money.
How do you NOT pay more for running double of everything + load balancers?
You do not need to pay double for everything, that might have been true with traditional VPS providers but it is not the way it works with cloud services. You decide on what kind of failure you're willing to tolerate and then architect based on those requirements (loss of multiple AZ's, loss of a region, etc..).

Let's say your website requires 4 application servers, you can then tolerate a single AZ failure by using 5 application servers and spreading them among 5 AZs.

If you already have 4 application servers you are probably already AZ tolerant; most people concerned about "doubling everything" are only running 1 instance.

Going by your example, If your website requires 1 application server, to tolerate a single AZ failure, it requires you to double the number of application servers.

Example - we have a service that used Kafka in the affected region that went down. Our primary kafka instance (R=3) survived but this auxiliary one failed and caused downtime. There's no way around this other than doubling the cost.

In most cases the elephant* in the room is your DB - it doesn't matter where your stateless application servers are, if your stateful DB goes down you're in trouble. It's also often 1) the hardest to replicate, as replication involves tradeoffs - see CAP theorem & co and 2) the most expensive, since it needs to be pretty beefy in terms of CPU, RAM and IO - all very expensive on AWS.

*: https://commons.wikimedia.org/wiki/File:Postgresql_elephant....

That's true, when only dealing with 1 server, you technically double the cost by adding a second server. My original comment was about "popular sites/services", that should be able to tolerate the costs and are most likely dealing with multiple servers.

For a single server deployment you can still reduce your downtime (with minimal costs) by having the ASG redeploy into another AZ on a failed health check.

Those stateless app servers are the easy part. But you need to be replicating the data, with all the cost and complexity decisions that comes with it.
You should get into the database business. A lot of money to be made there if things are so trivial for you.
The sounds of crickets is deafening!
I’m sorry about your feelings but you are wrong.

its more expensive to have more things and it’s more expensive to have more complicated things that are also complex. And things that can fall over are inherently more complicated.

A multi-az deployment is a checkbox in most AWS services, e.g. ASGs, RDS, load balancers, etc. Someone didn't check that box because they didn't know about it, there isn't much complexity in it.
Aren't multi-az deployments more expensive? That would be a valid reason not to check this checkbox, if your business can survive a bit of downtime here and there.
Most of that expense is just the cost of a hot failover, but there is some additional cost around inter-AZ data transfer. If someone is not checking the boxes for cost reasons, I would be surprised if they had failovers in the same AZ. It seems more likely they just don't have failovers.
A checkbox that might 3-4x the cost.
multi az brings multi complexity in terms of data duplication, consistency, if your app wasnt designed to handle those kind of scenarios and experience high users loads then you are in for a lot of problems.

designing for those scenarios increase complexity; cost; architecture style and most of the time it will bring you in microservices territory where most of the companies lack experience and just are following best practices in a field where engineers are expensive and few

RDS just has a button for multi-AZ primaries. No complexity or microservices.
ok lets say you have a master in one AZ and it dies. what happens?
Automatic failover within 60 seconds.
ok then what happened to your transactions that were in the middle of a process? the one that were commit only on one side?