Hacker News new | ask | show | jobs
by hinkley 2691 days ago
I can't recall who it was, but there was a big outage due to a data center running a monthly generator test and having a major customer go dark. The data center had 7 generators for 4 server rooms: 1 per room, a backup for each pair of rooms, and a backup for the backups. The primary and then both backups failed, so out went the lights.

You're pretty much damned if you do and damned if you don't. If you touch things that are working you could break them. If you don't touch things you never know what'll happen and you get fewer opportunities to learn. Move your servers around geographically and you might improve the odds that anything is working by reducing the odds that everything is working.

I don't think we're quite to a place yet where having servers down can be characterized as a non-event. Even if the customer can't see a behavioral difference, business units still tend to get quite anxious, and sometimes their theatrics put the whole process in jeopardy (not unlike trying to rescue a drowning man). It just hasn't been normalized yet.

1 comments

Look at Netflix with chaos monkey and simian army. Netflix routinely does catastrophic destructive testing on their production systems, sometimes evacuating an AWS region entirely. To them, a server going down is a non-event, because they designed their systems with the premise that servers go down.
Netflix is a great example of how to create robust systems. Keep in mind though, they have a very different risk profile than a large bank. no one is going to lose their life’s savings if netflix entire infrastructure crumbles. This might happen in a bank with a single server outtage. Don’t get me wrong, if I start a bank today, I use chaos monkey stategy. But if I take over a bank infrastructure, with cobol still representing a huge percentage of mission critical code that everyone is scared to touch... no chaos monkey. I might deliberately turn off a server for an hour to see what chaos ensues, but it’ll be after 3 months of analysis, longer if I begin to suspect the system does more than anyone can remember.
Not to mention adhering to the Fed's directives on tech-readiness, on which the Bank's license hinges on.
If a customer can't watch a film they get unhappy.

If a customer can't pay a fine - can't use their bank account - they go to jail. https://www.telegraph.co.uk/finance/personalfinance/bank-acc...

These are pretty different outcomes.