| HN Mirror

Traffic sloshing (basic overview): Say you have a pool of machines for a service (traditionally this problem is multi-regional, though it technically can happen at any scale). For some reason (machine restart, query of death, reloading, etc) a subset of your backends become unhealthy. This gets automatically detected by your framework, and the traffic gets routed to different machines. Now, you may have under-provisioned your backends (or you have a query of death), so this concentration of traffic on a smaller number of machines causes them to choke. You get a seesaw effect of traffic going around to the different backends, taking them out like a concentrated firehose. These failures all get detected by your framework, which routes traffic away from the backends. What you really wanted was a steady stream to all backends. A lot of load balancing systems have this failure mode. The good ones can detect it and converge back to a good steady state. The naive ones just keep the firehose spinning. It is harder to fall into this trap with simple binary health checking. It becomes a lot easier when you do traffic allocation by latency, or have more complicated health criteria that is easier to fail.

On the health checking/garbage data front: It's usually more of a problem when something misreports a bad backend, rather than misreporting a good backend. The latter is easy to catch (as you mention, haproxy does it). The former is hard because one misbehaved health-checker can suddenly unload all of your services.