Hacker News new | ask | show | jobs
by surge 2693 days ago
They had redundancy, the redundancy failed.

I mean truthfully when do you get to test your redundancy against a true disaster. It was a mess. WF is 20 companies rolled into one so the fact the disparate systems from 10 different banks works at all is kind of a miracle.

5 comments

I can't recall who it was, but there was a big outage due to a data center running a monthly generator test and having a major customer go dark. The data center had 7 generators for 4 server rooms: 1 per room, a backup for each pair of rooms, and a backup for the backups. The primary and then both backups failed, so out went the lights.

You're pretty much damned if you do and damned if you don't. If you touch things that are working you could break them. If you don't touch things you never know what'll happen and you get fewer opportunities to learn. Move your servers around geographically and you might improve the odds that anything is working by reducing the odds that everything is working.

I don't think we're quite to a place yet where having servers down can be characterized as a non-event. Even if the customer can't see a behavioral difference, business units still tend to get quite anxious, and sometimes their theatrics put the whole process in jeopardy (not unlike trying to rescue a drowning man). It just hasn't been normalized yet.

Look at Netflix with chaos monkey and simian army. Netflix routinely does catastrophic destructive testing on their production systems, sometimes evacuating an AWS region entirely. To them, a server going down is a non-event, because they designed their systems with the premise that servers go down.
Netflix is a great example of how to create robust systems. Keep in mind though, they have a very different risk profile than a large bank. no one is going to lose their life’s savings if netflix entire infrastructure crumbles. This might happen in a bank with a single server outtage. Don’t get me wrong, if I start a bank today, I use chaos monkey stategy. But if I take over a bank infrastructure, with cobol still representing a huge percentage of mission critical code that everyone is scared to touch... no chaos monkey. I might deliberately turn off a server for an hour to see what chaos ensues, but it’ll be after 3 months of analysis, longer if I begin to suspect the system does more than anyone can remember.
Not to mention adhering to the Fed's directives on tech-readiness, on which the Bank's license hinges on.
If a customer can't watch a film they get unhappy.

If a customer can't pay a fine - can't use their bank account - they go to jail. https://www.telegraph.co.uk/finance/personalfinance/bank-acc...

These are pretty different outcomes.

Facebook regularly takes down multiple data-centers at a time to stress test this. Its users rarely notice (which I think is the point).
There is a different level of criticality for "my post didn't go through, hit refresh" and "my transaction didn't go through - the restaurant said my card isn't working."

Would you honestly want to go to a bank and say "if we unplug this, we can find out what fails."

Facebook handles money too, though. Also I think the parent was making the point that Facebook builds software with resiliency in mind so when a failure does happen, the software deals with it gracefully.
They can have (and did have, at least the long time ago when I still used it) weird cache persistency errors and "please just refresh to fix that" type of workflows if you have bad luck. That sort of behaviour is simply not acceptable for a bank.
Are you talking about Messenger? That was a front-end issue, and they created React/flux to fix that.
They're <10 years old. Their DCs better be cleaner than my kitchen.

I've worked for banks here in Australia. Everything is 30+ years old. It's a shambles.

There is this company that had a grid power issue, batteries kicked in batteries running low, diesel generator show time, diesel generator doesn’t kick in. people scramble to shutdown before the batteries run out. So ITIL stuff happens and time to test it, guess what? diesel again doesn’t kick in (don’t recall the consequences thou)
Please expand ITIL, I don't know what it means.
https://en.m.wikipedia.org/wiki/ITIL Article explains it all
Then you’re an ITIL expert! Welcome to mid level IT management.

I’m serious.

Bloomberg does it once a year. That said, I have yet to encounter a company more obsessed with business continuity. I don't doubt that their failover systems and testing of them are well beyond the typical.
I think Bloomberg can stand to be down for 8 hours to simulate a disaster. Banks with legacy systems and people constantly dependent on them to conduct business can't risk an actual incident happening because they were testing what would happen if an incident happened.

Netflix designed their stuff from the ground up to fail over. Large monolith corporations who've inherited systems from other companies they've bought or merged with have challenges you won't see many places that have benefited from the 30 years of lessons that were taught at these companies.

> I think Bloomberg can stand to be down for 8 hours to simulate a disaster. Banks with legacy systems and people constantly dependent on them to conduct business can't risk an actual incident happening because they were testing what would happen if an incident happened.

No, it can't. Any loss of customer-facing functionality is a critical outage ("World Problem" in company terminology). There are a relatively small number of customers, but the terminal is critical to the operations of those who buy it. The terminal going down for eight hours would be a world-wide headline in the financial press.

A Tier 1 test that simulates loss of a datacenter takes a cluster one DC virtually offline. This puts an entire subset of services offline in that DC entirely. The test is coordinated with the teams who own the services to ensure their services fail over correctly. Any service disruption during the failover is a test failure. If it passes, the customers don't even know it happened. The goal is to be able to lose an entire DC and have the terminal customers not realize it until they hear about it on the news.

> I think Bloomberg can stand to be down for 8 hours to simulate a disaster.

Do you know what Bloomberg does? It powers equities trading markets around the world, 24/7. It isn't just news.

Well that's not true.

Chaos engineering and AWS weren't real things when they started building the company. And the system they have now doesn't resemble much of it was once.

Truth of the matter is they invested more in their infrastructure, but that's because their business plan required them to grow on the back of technological advances. Banks, it's seems, do not. Or maybe they do, and the some of these start up banks will usurp them.

Standard good practice should be to have a redundancy in place and test it at a regular interval. It should be part of periodic maintenance - fail to the backup so updates/grades can be applied to production and fail back to production once done.

But I’m guessing wellsfargo just doesn’t have a reason to care.

Business critical systems can’t afford not to test failover.

You can bail out of a test at the first sign of trouble. When a real outage hits, there’s no telling how long it will take to recover.

We run that drill every Wednesday morning.