|
|
|
|
|
by chris_va
4615 days ago
|
|
I'm glad to see more cluster management software getting open sourced, and this is sort of on the right track. However, looking at the design, this still has a long way to go. There are a lot of failure modes you guys haven't encountered yet, which will result in a few design tweaks. For example, what happens if your health checkers decide to start reporting garbage data (e.g. maybe they are too overloaded to properly perform health checks)? Or when you have a query of death being issued? Also, things like traffic sloshing can very quickly build resonant failures in a system like this. (Source: many years working on Google infrastructure, including causing outages related to load balancing code) |
|
Good point on garbage data reporting; we do basic validation in synapse, here: https://github.com/airbnb/synapse/blob/master/lib/synapse/se...
We could probably do more there to ensure valid names, IPs and ports (matching against a regex should do it). Also, because of the built-in health checking in haproxy, just the presence of some invalid name in the list of machines doesn't mean that we're going to try to start sending traffic there.