|
|
|
|
|
by armon
4619 days ago
|
|
I'd highly recommend taking a look at this page: http://www.serfdom.io/docs/internals/gossip.html. One of the great attributes of the gossip protocol is it is very robust to intermittent network failures. Under minimal packet loss conditions (<5%), the rate of false positives should be very low. This is due to a few techniques, one of which is indirect probing, and another is a novel "suspicion" mechanism. In the case of a network partition, the parts of the cluster can run in isolation and will recover when the partition heals. If you are interested, the paper referenced there ("SWIM: Scalable Weakly-consistent Infection-style Process Group Membership Protocol"), is the foundation of Serf. In the paper you can find more details about the behavior of the cluster, false positive rates under packet loss, and partition handling. tl;dr the systems is in fact designed with network errors in mind, as opposed to handling them being an afterthought. |
|