Hacker News new | ask | show | jobs
by igor47 4618 days ago
What is traffic sloshing?

Good point on garbage data reporting; we do basic validation in synapse, here: https://github.com/airbnb/synapse/blob/master/lib/synapse/se...

We could probably do more there to ensure valid names, IPs and ports (matching against a regex should do it). Also, because of the built-in health checking in haproxy, just the presence of some invalid name in the list of machines doesn't mean that we're going to try to start sending traffic there.

1 comments

Traffic sloshing (basic overview): Say you have a pool of machines for a service (traditionally this problem is multi-regional, though it technically can happen at any scale). For some reason (machine restart, query of death, reloading, etc) a subset of your backends become unhealthy. This gets automatically detected by your framework, and the traffic gets routed to different machines. Now, you may have under-provisioned your backends (or you have a query of death), so this concentration of traffic on a smaller number of machines causes them to choke. You get a seesaw effect of traffic going around to the different backends, taking them out like a concentrated firehose. These failures all get detected by your framework, which routes traffic away from the backends. What you really wanted was a steady stream to all backends. A lot of load balancing systems have this failure mode. The good ones can detect it and converge back to a good steady state. The naive ones just keep the firehose spinning. It is harder to fall into this trap with simple binary health checking. It becomes a lot easier when you do traffic allocation by latency, or have more complicated health criteria that is easier to fail.

On the health checking/garbage data front: It's usually more of a problem when something misreports a bad backend, rather than misreporting a good backend. The latter is easy to catch (as you mention, haproxy does it). The former is hard because one misbehaved health-checker can suddenly unload all of your services.

That happened at work a few weeks ago---basically, one of our components did a health check of of DNS when the health check DNS record was mistakenly deleted. Because of the way the health check code was written, that component thought the DNS servers were down and shut down. That in turn, shut down other components that were health checking that component. Boom! Instant virtual dominoes.