Hacker News new | ask | show | jobs
by benjchristensen 4952 days ago
That's an interesting thought and probably something that would make a good addition to the library.

It could be part of the default library or perhaps a custom strategy for the circuit breaker (once I finish abstracting it so it can be customized via a plugin: https://github.com/Netflix/Hystrix/issues/9).

At the scale Netflix clusters operate they basically get this randomness already because circuits open/close independently on each server (no cluster state or decision making).

In this screenshot https://github.com/Netflix/Hystrix/wiki/images/ops-social-64... you'll see how in a cluster of 234 servers that about 1/3 of them are tripped and the rest are still letting traffic through.

Thus the cluster naturally levels out to how much traffic can be hitting the degraded backend as circuits flip open/closed in a rolling manner across the instances.

Also, doing this makes sense even when a dependency doesn't have a useful redundancy and must fail fast and return an error.

It is far better to fail fast and let the end client (such as a browser, iPad, PS3, XBox etc) retry and hopefully get to the 2/3s that are still able to respond rather than let the system queue (or going into server-side retry loops and DDOS the backend) and fail and not let anything through.

We prefer obviously to have valid fallbacks but many don't and in those cases that is what we do - fail fast (timeout, reject, short-circuit) on instances where it can't serve the request and let the clients retry which in a large cluster of hundreds of instances almost always get a different route through the instances of the API and backend dependencies.

@benjchristensen

1 comments

thanks for posting, those insights/experiences with this architecture helped me understand some of the design decisions.

also, huge thanks to you and your team (and your employer!) for releasing an amazing volume of production-quality open source projects this year.