| > If you have an HA requirement your infra spend has just doubled, and the ROI on your DR tends to nothing (unless you can do something clever with on demand DR)? You’re also replicating everything your write to DR (so network and storage there). You're quite correct, but it's not such a simple trade-off. 1. What's a "disaster"? A) "Datacenter burns down" is very different to B) "Instance got overloaded with unexpected spike, need more capacity". For A), your distributed system will still have outages in a particular area, so you'll spin up new instances in the nearest live DC, at a time when everyone else is trying to do the same! And B) is not a disaster. 2. Disaster recovery is, by definition alone, an exceptional event. If you're reaching for DR protocols more than once a year, there's something seriously wrong. Being able to scale out means that you only need a few minutes before you're up to capacity, but shaving off (say) 30m in the event of a genuine black swan event depends on the exact business. I've seen Fintechs, FAANGs, Fortune 100 companies, banks, etc have hours of downtime with no apparent negative effects. It's a black swan event that's not an extinction level event, nor even close. A bigger problem for the business is that if a black swan event that results in: > Waiting 30 minutes for a machine to checksum 1TB of memory before loading the OS is painful in an outage scenario turns into an ELE, then it's the business that has problems, not the systems. A business that is so fragile to downtime is not going to be in business much longer anyway. |