Hacker News new | ask | show | jobs
by chris_va 4615 days ago
I'm glad to see more cluster management software getting open sourced, and this is sort of on the right track.

However, looking at the design, this still has a long way to go. There are a lot of failure modes you guys haven't encountered yet, which will result in a few design tweaks. For example, what happens if your health checkers decide to start reporting garbage data (e.g. maybe they are too overloaded to properly perform health checks)? Or when you have a query of death being issued? Also, things like traffic sloshing can very quickly build resonant failures in a system like this.

(Source: many years working on Google infrastructure, including causing outages related to load balancing code)

1 comments

What is traffic sloshing?

Good point on garbage data reporting; we do basic validation in synapse, here: https://github.com/airbnb/synapse/blob/master/lib/synapse/se...

We could probably do more there to ensure valid names, IPs and ports (matching against a regex should do it). Also, because of the built-in health checking in haproxy, just the presence of some invalid name in the list of machines doesn't mean that we're going to try to start sending traffic there.

Traffic sloshing (basic overview): Say you have a pool of machines for a service (traditionally this problem is multi-regional, though it technically can happen at any scale). For some reason (machine restart, query of death, reloading, etc) a subset of your backends become unhealthy. This gets automatically detected by your framework, and the traffic gets routed to different machines. Now, you may have under-provisioned your backends (or you have a query of death), so this concentration of traffic on a smaller number of machines causes them to choke. You get a seesaw effect of traffic going around to the different backends, taking them out like a concentrated firehose. These failures all get detected by your framework, which routes traffic away from the backends. What you really wanted was a steady stream to all backends. A lot of load balancing systems have this failure mode. The good ones can detect it and converge back to a good steady state. The naive ones just keep the firehose spinning. It is harder to fall into this trap with simple binary health checking. It becomes a lot easier when you do traffic allocation by latency, or have more complicated health criteria that is easier to fail.

On the health checking/garbage data front: It's usually more of a problem when something misreports a bad backend, rather than misreporting a good backend. The latter is easy to catch (as you mention, haproxy does it). The former is hard because one misbehaved health-checker can suddenly unload all of your services.

That happened at work a few weeks ago---basically, one of our components did a health check of of DNS when the health check DNS record was mistakenly deleted. Because of the way the health check code was written, that component thought the DNS servers were down and shut down. That in turn, shut down other components that were health checking that component. Boom! Instant virtual dominoes.