|
|
|
|
|
by xyzzy_plugh
1518 days ago
|
|
> The failures should be relatively rare; when I say relatively I mean on the level of natural node failure. And exactly how rare do you believe this to be? In my experience, node failures at scale of hundreds to thousands of nodes are monthly to weekly, if not daily. Generally speaking, stability is a normal distribution. Young, new instances experience similar failure rates as old instances. If you have any sort of maximum node lifetime (for example, a week) or scale dynamically on a daily basis then you'll see a lot of failures. |
|
But that generally mirrors my experience that automatic failover for stable software tends to cause more issues than it solves. A good (i.e. redundant hardware and software) Postgresql server is also so unlikely to fail that wrong detection and cascading issues from automatic failover are more likely than its actual benefits.