Hacker News new | ask | show | jobs
by skewart 3799 days ago
Am I the only one who is a little shocked that a power outage could have such a huge effect and bring them down for so long? I'm not an infrastructure guy, and I don't know anything about Github's systems, but aren't data center power outages pretty much exactly the kind of thing you plan for with multi-region failover and whatnot. Is it actually frighteningly easy for kind of to happen despite following best practices? Or is it more likely that there's more to the story than what they're sharing now?
11 comments

I am not at all surprised. There are 'best practices' and then there is what really happens based on business processes and needs. In reality, even the most cloudy of cloud providers will run into this problem at some point. Folks often come up with ideas of implementing something like Chaos Monkey in their data-center, then realize the actual impact it will have and find it is almost impossible to get the rest of the business to agree to this concept. It isn't as easy at it sounds. I only know of two businesses that have actually implemented Chaos Monkey; one being the company that coined the term. Even regular reboots won't catch these problems and if folks were honest, you would see +1 year up-times on most servers in most places. That is just based on my experiences and I am sure some of you have seen different.
The problem is most environments are very heteregenous. I evaluated chaos monkey approach for a big bank, the issue is that netflix has whole data centres full of loads of machines doing pretty much the same thing, streaming and serving.

And the worst that can happen is a customer's stream stops and they have to restart it.

But in most big companies you have thousands of apps that are all doing very different things. Perhaps a critical app might run on 4 hosts spread across two data centres - you're not going to convince people to have chaos monkey regularly and randomly bringing down these hosts, it would cause real impact and is risky. Yeh in theory it should be able to cope but in reality the scales in most orgs are quite different.

That said github sounds a lot more like the netflix end of the scale, doing one specific thing at large scale.

While Netflix as a company is focused at doing one specific thing at large scale, they're heavily vested in microservices and do actually have "thousands of apps that are all doing very different things".

Chaos Monkey fits when people build and deploy their services with the notion that any particular instance (or dependency) could fail at any given time. It's a tough road to evolve out of a legacy, monolithic stack without much redundancy baked in.

Whether they have broken up their apps into microservices doesn't seem to matter to me. That's just a matter of how they have organised their code, whether the actual app is monolithic or microlithic doesn't seem to matter.

They have a focussed business with relatively little variation in how they make money - all their customers simply pay for a streaming service.

Most large companies, certainly banks anyway, have thousands of apps because there's also thousands of different parts of the business making money in their own unique ways that have their own unique needs.

What works for netflix therefore can't work for other businesses, because the actual business is much more heterogenous than that of netflix and the technology will reflect that whether it is organised in microservices or monolithically - that's totally irrelevant.

> Perhaps a critical app might run on 4 hosts spread across two data centres - you're not going to convince people to have chaos monkey regularly and randomly bringing down these hosts, it would cause real impact and is risky. Yeh in theory it should be able to cope but in reality the scales in most orgs are quite different.

The difference between theory and reality is precisely the reason Chaos Monkey and tools like it exist.

What you're essentially saying is that in theory, these systems have been designed to be resilient, but in reality, they may not be. If that's the case, then you'd better verify your resiliency, because being resilient in theory but not reality isn't going to help you when your service goes down.

That's true, but if an app, say, is running on 4 hosts doing some boutique thing for a small unit of 20 traders, then the practical reality is that they might not want Chaos Monkey bringing down 25% of the throughput randomly, and interrupting whatever actual cash money requests are in progress on a host.

Itsa lot easier to promote that if it is thousands of servers doing something fairly mundane where, worst-case, it not working means a tiny tiny proportion of your customers have to restart their video stream. So what?

But for a small hetereogenous business where what's happening has a much higher cash density, the actual practicalities of randomly killing things in production and the risk that represents rather get in the way, even though in theory you should be able to kill anything in production with minimal impact, you are much less inclined to take that risk when the stakes are higher.

I think you're missing the point. The point of something like chaos monkey is to force you to build a system that won't lose money by "bringing down 25% of the throughput".
My point is that nomatter how well engineered your system is, to actually have chaos monkey running in production really depends on the risk profile and scale of your business.

As soon as chaos monkey cause a service interrupt for, say, traders - it would get turned off and whoever had such a bright idea fired. But if it causes a service interruption for a tiny proportion of people watching streaming videos - no big deal.

Its proponents just ignore this practical reality and seem politically unaware.

> In reality, even the most cloudy of cloud providers will run into this problem at some point.

Actually, wasn't this[0] what did happen several years ago when Amazon Ireland went down for days on end?[1]

[0] TL;DR: Cascading effects of power outage.

[1] http://readwrite.com/2011/08/08/amazons-ireland-services-sti... (didn't read the article, it was just high in the google search results)

Interesting. But if, lets say, a data center in London where they have a lot of boxes goes down completely, then they spin up boxes in Frankfurt and Riga to take up the load and reroute traffic. Service is disrupted for some customers for a few minutes. Some people lose some stuff completely because replication wasn't happening perfectly. But the entire site doesn't go down for everyone for two hours.

Are those kinds of failover scenarios frequently messy and risky at the scale of Github? Or is it more likely that in the context of a fast growing company, and even at a place as "cloudy" as Github, there are bound to be some serious bugs lurking in your system design?

I've experienced a brief full-scale power loss at a data center before. It is unbelievable how much goes wrong. The machines had been chugging along for years, happily doing their job, but on the next boot the hard drives were suddenly corrupted, or the power supplies broken. The impacts of that power outage were felt for at least six months.

It's one of those things where, if you're not regularly cutting power to your data center, you're not building resilience to such a thing happening. So when it does, it's not pretty. :)

> if you're not regularly cutting power to your data center, you're not building resilience to such a thing happening

Would love to read examples on who is doing this and how? Reminds me of Netflix's Choas monkey, only applied to electricity. :p

There's a mention of Facebook regularly doing this in the summary section of this instagram engineering post: http://engineering.instagram.com/posts/548723638608102/

EDIT: Here's more info: http://www.datacenterknowledge.com/archives/2014/09/15/faceb...

Awesome, thank you. :)
I remember reading a few years back that Yahoo once a week takes a random data center offline, just to make sure they could do that without issues. They probably didn't actually cut the power ;) But they used it as an argument against investing to much in emergency generators and such: they'll fail or cause accidents and you need the ability to fail-over either way, so make it routine.
I think trying to cut power at least once is better if it's possible. The reason is that digital is just an abstraction over analog, electrical activity. Plus there's actual analog in there doing work, too. So, seeing how all the chips in there respond to an actual and instantaneous drop of the power would be an interesting test of the models they're built against.

Like an above commenter mentioned, weird activity in electrical system can make some products go haywire and even corrupt data in unexpected ways. Of course, simulated takedowns and all appropriate measures for countering common issues should've already happened before a real one. Just to be extra clear there.

Google wrote an article about disaster recovery in 2012. https://queue.acm.org/detail.cfm?id=2371516
What data center was it?

I can't remember the last time there was a power outage at a Tier I or II data center -- they're all N+1, from the cabinet PDUs to the distribution units to the UPSes to the diesel generators. Some even go so far as to connect to multiple in-feeds from different utility providers.

At my company, every piece of server, storage, and network equipment we own is connected via redundant power supplies to different circuits (except for nonessential equipment like monitors; we can simply re-plug them into the functioning circuit). I can't imagine running a datacenter any other way.

I have no doubt the people at Github have spent a lot of time thinking about multi-region failover. You never hear about the successful failovers --- only the ones which cause outages. To quote a famous US politician: "There are known knowns; there are things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns – the ones we don't know we don't know."

You can't failover things you didn't predict.

Except you can predict it. Your fail-over mechanism needs to be able to detect these things:

1. Degraded performance that might be a fault justifying fail-over. A human in the loop is a must here as complex services can just act weird under load or randomly.

2. Corrupted data or packets coming in that might indicate a failure. Might automatically fail-over here.

3. No data coming in at all for 5-10 seconds, esp on a dedicated line. Fail-over automatically here as nothing sending data is already the definition of downtime and probably indicates a huge failure.

Companies should also do plenty of practice fail-overs at various layers of the stack during non-critical hours to ensure the mechanisms work. In Github's case, number 3 should've applied and solutions far back as 80's would kick in automatically within seconds to minutes. Their tech or DR setup must just not be capable of that. There could be good financial reasons or something for that but not technological.

Heh, that quote always amuses me. People hated it, but it actually does make a lot of sense.
My experience matches exactly what Github says. Power outages can bring down even the best systems. The problem is that it is never clear what parts of the systems will continue to work in these situations, until it actually happens. Especially when you're talking about complex applications that depend on many moving pieces. The point is, the more complex your online app, the more points of failure can be exposed in these situations.
We've been mitigating against this kind of thing with backups at other datacenters or colos for a while. They can be hot standby, cold standby, slightly degraded in performance, whatever. I also recommend the backup be on a different part of the overall power grid in case it cascades in failure. The good colo's often have connections to multiple backbones, too, which is extra redundancy.

That all assumes there's a total and catastrophic failure at main datacenter. If not, there's local backup batteries to sustain a smoother, fail-over plus shutdown. Plus, there's tricks like isolating the monitoring systems from main systems and power supply using things data diodes over octocouplers or infrared. At least one thing will still be working and feeding you reliable information over a wireless connection after the full failure.

NonStop and VMS setups from late 80's did better than Github. My own setups involving a minimum of servers plus apps with loose coupling could fail-over in such a situation. So, this just has to be bad architecture caused by who knows what. Examples below of OpenVMS in catastrophic situations having either no downtime or short downtime due to good architecture plus disaster planning.

Case study of active-active at World Trade Center http://h71000.www7.hp.com/openvms/brochures/commerzbank/comm...

Marketing piece where HP straight-up detonates a datacenter. Guess who was number 1 in recovery. :) https://youtu.be/bUwthF9x210?t=34s

I doubt it matters to anybody but was it really necessary to kill the fish?
Watch it until the end :)
Haha nice catch. I missed it originally thinking it would just be more marketing crap. So, they probably just moved it before detonating. Not sadistic bastards after all.
Watch 'The Prestige'.
I know... My original reply mentioned two scenarios with one having replacement fish. Then, I thought people would think I'm overly paranoid or negative. I just couldn't help wonder if they'd blow the fish for fun then avoid liability with similar looking ones. Then, I edited the comment for sake of presumption of innocence.

But, yeah, I hear you... Great movie as well. One of few that brings my favorite mad scientist into eye of mainstream audience as well. I doubt I must name him. :)

EDIT to add: I'm guessing you think the geeks were too sadistic to pass up the opportunity, eh?

I did, but unless the whole video was a fake it doesn't really matter does it? And if it is then that does not reflect well on HP either...
I didn't think about the image angle. Yeah, you'd think a marketing person would be like, "Wait, this could lead to a PETA lawsuit and lower sales. Not to mention our segment that likes fish."
Dude, I was thinking the same thing! That was seriously f*ed up. They should've left some cool fireworks or something left-over from July 4th. Or some safe-ish chemical that would make colorful smoke. All kinds of tricks you can do without killing live animals.

I mean, I've heard about things so wrong and ease it's like shooting fish in a bucket but... exploding fish in a datacenter? That's on another level.

> HP straight-up detonates a datacenter

Apparently 5 server racks in the middle of an open field is a "datacenter".

Nah, a collection of computers with high-availability setup communicating with another collection constantly over a dedicated or high-bandwidth line. In the demo, it was 5 racks in an open field. In the bank study, it was a whole bank's worth of computers in two locations. For some organizations, it's 5+ of them just to be sure.

The common trend is that the systems constantly sync critical data, can detect downtime, and automatically (or manually) fail-over when it occurs. Been OS's and ISV's offering that capability with many proven in field going back decades. Certain high-tech companies just don't apply those for whatever reasons. Maybe their stacks just still don't have that feature.

I am surprised at the data center. Power failure is one of the most basic parts of being N+1 for a data center. That is why they have batteries (last a few minutes) and then diesel generators (last days if needed).
Stuff happens, and even if you test all kinds of things real failure situations always can work differently, with partial failures etc. Just takes one important subsystem hitting an unforeseen edge case, and going completely down is in many cases better than risking running in a state that destroys data or does other bad things. Same for taking your time to go back online.

The cases that work are not the ones you hear about. Best practices and testing reduce the risk of making the news, but can't guarantee success.

The only way you can build fault-resilient systems is to frequently test fault injection scenarios. Netflix is pretty mature in this regard, perhaps Github can learn from their example.

That said, it's possible that github may have considered that this particular style of outage is rare enough that they don't want to make their design tolerate it. Though if that were the case, I'd wager they'd re-evaluate the cost/benefit right around now. :)

Gotta love it when a top comment starts with "Am I the only one".
It is a bit surprising actually. It means they haven't built their app to be tolerant of single DC loss, either on purpose or because they didn't test it properly.

Purely conjecture, but I suspect since github uses mysql cluster they only write to a single dc, which would be the primary dc that failed in this case.

I'm shocked as well. You would think they would deploy in multiple availability zones at the very least.