|
|
|
|
|
by takeda
2216 days ago
|
|
Can Patroni tell if master node is not responsive because it is busy vs dead? GitHub (I believe) had few outages that caused data loss because their auto failover mechanism kicked in when it shouldn't. I would actually be interested if aphyr's analysis of Patroni and other distributed add-ons to PostgreSQL. |
|
The only question is how soon are you going to page humans. After the automated mechanism flipped your master 2-3 times but the cluster still hasn't made progress [nothing coming out of the master; or it locks up after a few minutes again]), or right after some other automated mechanism detects that there's a problem.
Whatever automation you have in place, it has advantages and disadvantages. In the GitHub case - I suppose - they determined post-mortem that it would have been better to just let the master chug through the incoming onslaught of queries instead of failing over, and over, and over. (But of course this seems like a trivial problem in any auto failover setup, so I suspect there's more to the story.)