Hacker News new | ask | show | jobs
by JetSetWilly 3793 days ago
The problem is most environments are very heteregenous. I evaluated chaos monkey approach for a big bank, the issue is that netflix has whole data centres full of loads of machines doing pretty much the same thing, streaming and serving.

And the worst that can happen is a customer's stream stops and they have to restart it.

But in most big companies you have thousands of apps that are all doing very different things. Perhaps a critical app might run on 4 hosts spread across two data centres - you're not going to convince people to have chaos monkey regularly and randomly bringing down these hosts, it would cause real impact and is risky. Yeh in theory it should be able to cope but in reality the scales in most orgs are quite different.

That said github sounds a lot more like the netflix end of the scale, doing one specific thing at large scale.

2 comments

While Netflix as a company is focused at doing one specific thing at large scale, they're heavily vested in microservices and do actually have "thousands of apps that are all doing very different things".

Chaos Monkey fits when people build and deploy their services with the notion that any particular instance (or dependency) could fail at any given time. It's a tough road to evolve out of a legacy, monolithic stack without much redundancy baked in.

Whether they have broken up their apps into microservices doesn't seem to matter to me. That's just a matter of how they have organised their code, whether the actual app is monolithic or microlithic doesn't seem to matter.

They have a focussed business with relatively little variation in how they make money - all their customers simply pay for a streaming service.

Most large companies, certainly banks anyway, have thousands of apps because there's also thousands of different parts of the business making money in their own unique ways that have their own unique needs.

What works for netflix therefore can't work for other businesses, because the actual business is much more heterogenous than that of netflix and the technology will reflect that whether it is organised in microservices or monolithically - that's totally irrelevant.

> Perhaps a critical app might run on 4 hosts spread across two data centres - you're not going to convince people to have chaos monkey regularly and randomly bringing down these hosts, it would cause real impact and is risky. Yeh in theory it should be able to cope but in reality the scales in most orgs are quite different.

The difference between theory and reality is precisely the reason Chaos Monkey and tools like it exist.

What you're essentially saying is that in theory, these systems have been designed to be resilient, but in reality, they may not be. If that's the case, then you'd better verify your resiliency, because being resilient in theory but not reality isn't going to help you when your service goes down.

That's true, but if an app, say, is running on 4 hosts doing some boutique thing for a small unit of 20 traders, then the practical reality is that they might not want Chaos Monkey bringing down 25% of the throughput randomly, and interrupting whatever actual cash money requests are in progress on a host.

Itsa lot easier to promote that if it is thousands of servers doing something fairly mundane where, worst-case, it not working means a tiny tiny proportion of your customers have to restart their video stream. So what?

But for a small hetereogenous business where what's happening has a much higher cash density, the actual practicalities of randomly killing things in production and the risk that represents rather get in the way, even though in theory you should be able to kill anything in production with minimal impact, you are much less inclined to take that risk when the stakes are higher.

I think you're missing the point. The point of something like chaos monkey is to force you to build a system that won't lose money by "bringing down 25% of the throughput".
My point is that nomatter how well engineered your system is, to actually have chaos monkey running in production really depends on the risk profile and scale of your business.

As soon as chaos monkey cause a service interrupt for, say, traders - it would get turned off and whoever had such a bright idea fired. But if it causes a service interruption for a tiny proportion of people watching streaming videos - no big deal.

Its proponents just ignore this practical reality and seem politically unaware.