| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by dyanacek 1709 days ago

Good point around avoiding the call in the first place. This is a very tricky topic, I’ve found. Things that try to guess the nuanced health of a dependency can lead to outages when they guess wrong. These circuit breakers are helpful if they’re right, but harmful if they’re wrong.

For example, say a service is backed by a partitioned cache cluster, where the data is hashed to a particular cache node. Now let’s say one node has a problem, causing requests to data that lives on that node to fail, but others to succeed. If a client is making requests for data that happens to live on all nodes (the client doesn’t know about these nodes by the way, it’s just an implementation detail of the service) and sees an increased error rate, should it start failing some requests? It could take a single partition outage and increase the scope of impact into a full outage.

Anyway I’ve been meaning to write an Amazon Builders’ Library article on this topic, or to convince someone else to do it (looking at you, Marc Brooker!)