|
|
|
|
|
by hnjst
1789 days ago
|
|
Completely aligned with my parent, day-to-day operations are often slightly more complex than making sure your server run your app properly, I'd argue that our services are increasingly dependent on others (in both directions, dependencies multiply and are more and more critical). That's also by interacting more with external entities that they bring more value. > The best solution I can come up with is a synchronously-replicated append-only log store which is utilized in a primary/sync-witness/async-witness/... configuration. The first tier of resilience would be synchronous and provided by a set of witness nodes which must ack as a majority to progress primary. These nodes would ideally be within 1-2ms of the primary. The async witnesses could be in orbit and/or on mars. These are more about extreme geological disaster recovery. The witness nodes would also use a separate consensus protocol to decide when the primary needs to be taken down and replaced with a sync (or god forbid async) witness. They would be able to elect an emergency leader separate from the primary who would be authorized to stop the bad primary in the hypervisor, and edit any relevant DNS records to ensure traffic stops hitting the bad system. This part was what I felt deserved a counter-point though. Consensus is indeed at the core of the issue once you want distributed fault tolerance. However, I think you'll quickly hit two things with your approach: 1-2ms of latency, I fear that it may come with highly correlated failures on the "first tier of resilience". Moreover, the "second tier" being much farther, keeping them in consensus will imply harsh trade-offs. If you use synchronous consensus protocols, you'll slow down drastically the "first tier" (assuming you want consistency), if you go for the asynchronous replication (not consensus, this matters...) then the second tier can't really intervene on leader election or failover without risking a partition on a false positive (and if you try to be conservative there your RPO will suffer). If you're into these (fundamental) issues, I'd recommend Leslie Lamport's work (i.e. https://www.college-de-france.fr/site/en-martin-abadi/semina... or http://www.lamport.org/), the paper pointing a disappointing impossibility: https://dl.acm.org/doi/abs/10.1145/3149.214121 and its generalization (https://dl.acm.org/doi/abs/10.1145/167088.167119). |
|
Where your datacenters are geographically located is usually a big first step in even starting these types of conversations. The nature of "maybe sync replication represents a liability or is feasible" might be a conversation about the geography of a region and statistical likelihood of certain disasters impacting multiple sites simultaneously.
Some customers cant ever afford to lose a single transaction no matter what, some just need it to be reasonably stable but incredibly fast (e.g. gaming vs banking).
Will definitely be spending some time reviewing Lamport's works again. Establishing the notion of stable time between all participants is a fascinating way to solve a lot of problems in distributed systems.