Bloomberg does it once a year. That said, I have yet to encounter a company more obsessed with business continuity. I don't doubt that their failover systems and testing of them are well beyond the typical.
I think Bloomberg can stand to be down for 8 hours to simulate a disaster. Banks with legacy systems and people constantly dependent on them to conduct business can't risk an actual incident happening because they were testing what would happen if an incident happened.
Netflix designed their stuff from the ground up to fail over. Large monolith corporations who've inherited systems from other companies they've bought or merged with have challenges you won't see many places that have benefited from the 30 years of lessons that were taught at these companies.
> I think Bloomberg can stand to be down for 8 hours to simulate a disaster. Banks with legacy systems and people constantly dependent on them to conduct business can't risk an actual incident happening because they were testing what would happen if an incident happened.
No, it can't. Any loss of customer-facing functionality is a critical outage ("World Problem" in company terminology). There are a relatively small number of customers, but the terminal is critical to the operations of those who buy it. The terminal going down for eight hours would be a world-wide headline in the financial press.
A Tier 1 test that simulates loss of a datacenter takes a cluster one DC virtually offline. This puts an entire subset of services offline in that DC entirely. The test is coordinated with the teams who own the services to ensure their services fail over correctly. Any service disruption during the failover is a test failure. If it passes, the customers don't even know it happened. The goal is to be able to lose an entire DC and have the terminal customers not realize it until they hear about it on the news.
Chaos engineering and AWS weren't real things when they started building the company. And the system they have now doesn't resemble much of it was once.
Truth of the matter is they invested more in their infrastructure, but that's because their business plan required them to grow on the back of technological advances. Banks, it's seems, do not. Or maybe they do, and the some of these start up banks will usurp them.
Standard good practice should be to have a redundancy in place and test it at a regular interval. It should be part of periodic maintenance - fail to the backup so updates/grades can be applied to production and fail back to production once done.
But I’m guessing wellsfargo just doesn’t have a reason to care.
Netflix designed their stuff from the ground up to fail over. Large monolith corporations who've inherited systems from other companies they've bought or merged with have challenges you won't see many places that have benefited from the 30 years of lessons that were taught at these companies.