| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by mattew 5569 days ago
	"Currently, Netflix uses a service called "Chaos Monkey" to simulate service failure. Basically, Chaos Monkey is a service that kills other services. We run this service because we want engineering teams to be used to a constant level of failure in the cloud. Services should automatically recover without any manual intervention. We don't however, simulate what happens when an entire AZ goes down and therefore we haven't engineered our systems to automatically deal with those sorts of failures. Internally we are having discussions about doing that and people are already starting to call this service "Chaos Gorilla"." I am wondering how they could simulate the loss of an AZ. Any ideas?

5 comments

efsavage 5569 days ago

> I am wondering how they could simulate the loss of an AZ. Any ideas?

  Nelson: How many chaos monkeys will there be? 
  Bart Simpson: One at first, but he'll train others.

link

SpikeGronim 5569 days ago

There are several ways to do it. Kill all the instances. Use a firewall to blackhole all the instances. Use traffic shaping to degrade the latency or packet loss of all the instances.

link

ceejayoz 5569 days ago

> I am wondering how they could simulate the loss of an AZ. Any ideas?

Kill all instances in an AZ?

link

gbelote 5569 days ago

> I am wondering how they could simulate the loss of an AZ. Any ideas?

They could instrument whatever library they use to interact with AWS and make it report failures or fail to respond to "create new instance"-like commands.

link

RyanKearney 5569 days ago

Perhaps they have groups set up in their "Chaos Monkey" tool? Like a sort of take down ALL services in GROUP B type of command?

link