|
|
|
|
|
by truthseeker11
2567 days ago
|
|
The outage lasted two days for our domain (edu, sw region). I understand that they are reporting a single day, 3-4 hours of serious issues but that’s not what we experienced. Great write up otherwise, glad they are sharing openly |
|
Any given production system that works will have capacity needed for normal demand, plus some safety margin. Unused capacity is expensive, so you won't see a very high safety margin. And, in fact, as you pool more and more workloads, it becomes possible to run with smaller safety margins without running into shortages.
These systems will have some capacity to onboard new workloads, let us call it X. They have the sum of all onboarded workloads, let us call that Y. Then there is the demand for the services of Y, call that Z.
As you may imagine, Y is bigger than X, by a lot. And when X falls, the capacity to handle Z falls behind.
So in a disaster recovery scenario, you start with:
* the same demand, possibly increased from retry logic & people mashing F5, of Z
* zero available capacity, Y, and
* only X capacity-increase-throughput.
As it recovers you get thundering herds, slow warmups, systems struggling to find each other and become correctly configured etc etc.
Show me a system that can "instantly" recover from an outage of this magnitude and I will show you a system that's squandering gigabucks and gigawatts on idle capacity.