Hacker News new | ask | show | jobs
by chris_va 4615 days ago
Traffic sloshing (basic overview): Say you have a pool of machines for a service (traditionally this problem is multi-regional, though it technically can happen at any scale). For some reason (machine restart, query of death, reloading, etc) a subset of your backends become unhealthy. This gets automatically detected by your framework, and the traffic gets routed to different machines. Now, you may have under-provisioned your backends (or you have a query of death), so this concentration of traffic on a smaller number of machines causes them to choke. You get a seesaw effect of traffic going around to the different backends, taking them out like a concentrated firehose. These failures all get detected by your framework, which routes traffic away from the backends. What you really wanted was a steady stream to all backends. A lot of load balancing systems have this failure mode. The good ones can detect it and converge back to a good steady state. The naive ones just keep the firehose spinning. It is harder to fall into this trap with simple binary health checking. It becomes a lot easier when you do traffic allocation by latency, or have more complicated health criteria that is easier to fail.

On the health checking/garbage data front: It's usually more of a problem when something misreports a bad backend, rather than misreporting a good backend. The latter is easy to catch (as you mention, haproxy does it). The former is hard because one misbehaved health-checker can suddenly unload all of your services.

1 comments

That happened at work a few weeks ago---basically, one of our components did a health check of of DNS when the health check DNS record was mistakenly deleted. Because of the way the health check code was written, that component thought the DNS servers were down and shut down. That in turn, shut down other components that were health checking that component. Boom! Instant virtual dominoes.