|
|
|
|
|
by JimDabell
3793 days ago
|
|
> Perhaps a critical app might run on 4 hosts spread across two data centres - you're not going to convince people to have chaos monkey regularly and randomly bringing down these hosts, it would cause real impact and is risky. Yeh in theory it should be able to cope but in reality the scales in most orgs are quite different. The difference between theory and reality is precisely the reason Chaos Monkey and tools like it exist. What you're essentially saying is that in theory, these systems have been designed to be resilient, but in reality, they may not be. If that's the case, then you'd better verify your resiliency, because being resilient in theory but not reality isn't going to help you when your service goes down. |
|
Itsa lot easier to promote that if it is thousands of servers doing something fairly mundane where, worst-case, it not working means a tiny tiny proportion of your customers have to restart their video stream. So what?
But for a small hetereogenous business where what's happening has a much higher cash density, the actual practicalities of randomly killing things in production and the risk that represents rather get in the way, even though in theory you should be able to kill anything in production with minimal impact, you are much less inclined to take that risk when the stakes are higher.