Hacker News new | ask | show | jobs
by nemothekid 1429 days ago
If you already have 4 application servers you are probably already AZ tolerant; most people concerned about "doubling everything" are only running 1 instance.

Going by your example, If your website requires 1 application server, to tolerate a single AZ failure, it requires you to double the number of application servers.

Example - we have a service that used Kafka in the affected region that went down. Our primary kafka instance (R=3) survived but this auxiliary one failed and caused downtime. There's no way around this other than doubling the cost.

2 comments

In most cases the elephant* in the room is your DB - it doesn't matter where your stateless application servers are, if your stateful DB goes down you're in trouble. It's also often 1) the hardest to replicate, as replication involves tradeoffs - see CAP theorem & co and 2) the most expensive, since it needs to be pretty beefy in terms of CPU, RAM and IO - all very expensive on AWS.

*: https://commons.wikimedia.org/wiki/File:Postgresql_elephant....

That's true, when only dealing with 1 server, you technically double the cost by adding a second server. My original comment was about "popular sites/services", that should be able to tolerate the costs and are most likely dealing with multiple servers.

For a single server deployment you can still reduce your downtime (with minimal costs) by having the ASG redeploy into another AZ on a failed health check.