| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by LinuxBender 3799 days ago
	I am not at all surprised. There are 'best practices' and then there is what really happens based on business processes and needs. In reality, even the most cloudy of cloud providers will run into this problem at some point. Folks often come up with ideas of implementing something like Chaos Monkey in their data-center, then realize the actual impact it will have and find it is almost impossible to get the rest of the business to agree to this concept. It isn't as easy at it sounds. I only know of two businesses that have actually implemented Chaos Monkey; one being the company that coined the term. Even regular reboots won't catch these problems and if folks were honest, you would see +1 year up-times on most servers in most places. That is just based on my experiences and I am sure some of you have seen different.

3 comments

JetSetWilly 3799 days ago

The problem is most environments are very heteregenous. I evaluated chaos monkey approach for a big bank, the issue is that netflix has whole data centres full of loads of machines doing pretty much the same thing, streaming and serving.

And the worst that can happen is a customer's stream stops and they have to restart it.

But in most big companies you have thousands of apps that are all doing very different things. Perhaps a critical app might run on 4 hosts spread across two data centres - you're not going to convince people to have chaos monkey regularly and randomly bringing down these hosts, it would cause real impact and is risky. Yeh in theory it should be able to cope but in reality the scales in most orgs are quite different.

That said github sounds a lot more like the netflix end of the scale, doing one specific thing at large scale.

link

drather19 3799 days ago

While Netflix as a company is focused at doing one specific thing at large scale, they're heavily vested in microservices and do actually have "thousands of apps that are all doing very different things".

Chaos Monkey fits when people build and deploy their services with the notion that any particular instance (or dependency) could fail at any given time. It's a tough road to evolve out of a legacy, monolithic stack without much redundancy baked in.

link

JetSetWilly 3799 days ago

Whether they have broken up their apps into microservices doesn't seem to matter to me. That's just a matter of how they have organised their code, whether the actual app is monolithic or microlithic doesn't seem to matter.

They have a focussed business with relatively little variation in how they make money - all their customers simply pay for a streaming service.

Most large companies, certainly banks anyway, have thousands of apps because there's also thousands of different parts of the business making money in their own unique ways that have their own unique needs.

What works for netflix therefore can't work for other businesses, because the actual business is much more heterogenous than that of netflix and the technology will reflect that whether it is organised in microservices or monolithically - that's totally irrelevant.

link

JimDabell 3799 days ago

> Perhaps a critical app might run on 4 hosts spread across two data centres - you're not going to convince people to have chaos monkey regularly and randomly bringing down these hosts, it would cause real impact and is risky. Yeh in theory it should be able to cope but in reality the scales in most orgs are quite different.

The difference between theory and reality is precisely the reason Chaos Monkey and tools like it exist.

What you're essentially saying is that in theory, these systems have been designed to be resilient, but in reality, they may not be. If that's the case, then you'd better verify your resiliency, because being resilient in theory but not reality isn't going to help you when your service goes down.

link

JetSetWilly 3799 days ago

That's true, but if an app, say, is running on 4 hosts doing some boutique thing for a small unit of 20 traders, then the practical reality is that they might not want Chaos Monkey bringing down 25% of the throughput randomly, and interrupting whatever actual cash money requests are in progress on a host.

Itsa lot easier to promote that if it is thousands of servers doing something fairly mundane where, worst-case, it not working means a tiny tiny proportion of your customers have to restart their video stream. So what?

But for a small hetereogenous business where what's happening has a much higher cash density, the actual practicalities of randomly killing things in production and the risk that represents rather get in the way, even though in theory you should be able to kill anything in production with minimal impact, you are much less inclined to take that risk when the stakes are higher.

link

nvarsj 3799 days ago

I think you're missing the point. The point of something like chaos monkey is to force you to build a system that won't lose money by "bringing down 25% of the throughput".

link

JetSetWilly 3798 days ago

My point is that nomatter how well engineered your system is, to actually have chaos monkey running in production really depends on the risk profile and scale of your business.

As soon as chaos monkey cause a service interrupt for, say, traders - it would get turned off and whoever had such a bright idea fired. But if it causes a service interruption for a tiny proportion of people watching streaming videos - no big deal.

Its proponents just ignore this practical reality and seem politically unaware.

link

lomnakkus 3799 days ago

> In reality, even the most cloudy of cloud providers will run into this problem at some point.

Actually, wasn't this[0] what did happen several years ago when Amazon Ireland went down for days on end?[1]

[0] TL;DR: Cascading effects of power outage.

[1] http://readwrite.com/2011/08/08/amazons-ireland-services-sti... (didn't read the article, it was just high in the google search results)

link

skewart 3799 days ago

Interesting. But if, lets say, a data center in London where they have a lot of boxes goes down completely, then they spin up boxes in Frankfurt and Riga to take up the load and reroute traffic. Service is disrupted for some customers for a few minutes. Some people lose some stuff completely because replication wasn't happening perfectly. But the entire site doesn't go down for everyone for two hours.

Are those kinds of failover scenarios frequently messy and risky at the scale of Github? Or is it more likely that in the context of a fast growing company, and even at a place as "cloudy" as Github, there are bound to be some serious bugs lurking in your system design?

link