With data sharded across three nodes, why can't a replication factor of two handle one node's failure? Why does the replication factor need to be three?
In our current architecture, we perform failover by selecting a new primary among the existing replicas for a given shard. We do not however add new servers to the replica list during a failover condition.
Therefore, if you have a table with two replicas per shard and have a one node failure, you would only have a single replica left.
Currently we do not automatically failover in that case. We always require a majority of replicas to be present for any given shard before failing over the primary of that shard.
I believe we could relax this constraint in this specific case, and allow failing over despite only a single replica being left, as long as we still have a majority for the table overall. (I'm going to double-check this with one of our core developers...)
Even if we did perform the failover, that would not restore write availability of the table though, since there wouldn't be a majority to acknowledge a write (unless you set write_acks to "single").
It could still be useful to restore availability for up to date read queries.
We might add support for auto failover in this scenario in the future.
In our current architecture, we perform failover by selecting a new primary among the existing replicas for a given shard. We do not however add new servers to the replica list during a failover condition. Therefore, if you have a table with two replicas per shard and have a one node failure, you would only have a single replica left.
Currently we do not automatically failover in that case. We always require a majority of replicas to be present for any given shard before failing over the primary of that shard.
I believe we could relax this constraint in this specific case, and allow failing over despite only a single replica being left, as long as we still have a majority for the table overall. (I'm going to double-check this with one of our core developers...) Even if we did perform the failover, that would not restore write availability of the table though, since there wouldn't be a majority to acknowledge a write (unless you set write_acks to "single"). It could still be useful to restore availability for up to date read queries. We might add support for auto failover in this scenario in the future.