|
|
|
|
|
by jb_gericke
1133 days ago
|
|
Reminiscent of Adrian Cockcroft “Gigalith” architecture. Not sure on the economy of scale though - if you need redundancy on a scale up architecture you essentially need to have parity compute sitting idle (likely in another location/region/data centre). If you have an HA requirement your infra spend has just doubled, and the ROI on your DR tends to nothing (unless you can do something clever with on demand DR)? You’re also replicating everything your write to DR (so network and storage there). Also more practically, you’re just less flexible. Waiting 30 minutes for a machine to checksum 1TB of memory before loading the OS is painful in an outage scenario (every reboot hurts, replenishing cache hurts, losing a massive instance hurts). Do like the sentiment around simplicity though! |
|
You're quite correct, but it's not such a simple trade-off.
1. What's a "disaster"? A) "Datacenter burns down" is very different to B) "Instance got overloaded with unexpected spike, need more capacity". For A), your distributed system will still have outages in a particular area, so you'll spin up new instances in the nearest live DC, at a time when everyone else is trying to do the same! And B) is not a disaster.
2. Disaster recovery is, by definition alone, an exceptional event. If you're reaching for DR protocols more than once a year, there's something seriously wrong. Being able to scale out means that you only need a few minutes before you're up to capacity, but shaving off (say) 30m in the event of a genuine black swan event depends on the exact business. I've seen Fintechs, FAANGs, Fortune 100 companies, banks, etc have hours of downtime with no apparent negative effects. It's a black swan event that's not an extinction level event, nor even close.
A bigger problem for the business is that if a black swan event that results in:
> Waiting 30 minutes for a machine to checksum 1TB of memory before loading the OS is painful in an outage scenario
turns into an ELE, then it's the business that has problems, not the systems. A business that is so fragile to downtime is not going to be in business much longer anyway.