|
I am implementing a Multi-Paxos variant called Viewstamped Replication (http://pmg.csail.mit.edu/papers/vr-revisited.pdf) for TigerBeetle (https://github.com/coilhq/tigerbeetle) and keeping notes along the way to help other implementors down the line. Some things I'm finding so far: * As developers, we're used to thinking of services in terms of streaming TCP connections and RPCs. You send a request on a connection and get a response back on the same connection. However, distributed consensus algorithms (or at least their authors) like to think and write in terms of messages and message passing and the classic Actor pattern. For example, it's not uncommon for a consensus client to send a message to a leader but then get the ACK back from another server, a subsequently elected leader. That's at odds with the networking protocol we're used to. It's not always easy to shoehorn a consensus protocol onto a system that already has a TCP oriented design. Embrace message passing and multi-path routing. * We're familiar with Jepsen. The network fault model is front of mind (dropped/delayed/replayed/corrupted messages, partitions, asymmetrical network topologies and performance). We're far less wary of the storage fault model: latent sector errors (EIO), silent bit rot, misdirected writes (writes written by firmware to the wrong sector), corrupt file system metadata (wrong journal file size, disappearing critical files), kernel page cache coherency issues (marking dirty pages clean after an fsync EIO), confusing journal corruption for a torn write after power failure. * We underestimate the sheer bulk of the code we need to write to implement all the components of a practical consensus protocol correctly (a consensus replica to run the protocol at each node, a write ahead journal for storage, a message bus for in-process or remote messaging, a state machine for service up calls). The consensus protocol invariants are tough but limited, but the amount of code required to be written for all these components is brutal and there are so many pitfalls along the way. For example, when you read from your write ahead journal at startup and you find a checksum mismatch, do you assume this is because of a torn write after power failure as ZooKeeper and LogCabin do? What if it was actually just bit rot halfway through your log? How would you change your write ahead journal to disentangle these? * We tend to think of the correctness of any given consensus function as binary, and fail to appreciate the broad spectrum of safety requirements required for specific components of the consensus algorithm. In other words, we don't always take fully to heart that some consensus messages are more critical than others. For example, we might resend an ACK to the leader if we detect (via op number) that we've already logged the prepare for that op number. However, most implementations I've seen neglect to assert and double-check that we really do have exactly what the leader is asking us to persist before we ACK. It's a simple verification check to compare checksums before skipping the journal write and acking the duplicate prepare and yet we don't. * Another example, when we count messages from peers to establish quorum during leader election, we might count these messages without applying all the assertions we can think of on them. For example, are we asserting that all the messages we're counting are actually for the same leader election term? Or did we simply assume that we reset the array of messages being counted during the appropriate state transition sometime back in the past? The former is a much stronger guarantee, because it keeps you from double-counting stale leader election messages from past election phases, especially if these were successive (e.g. multiple rounds of elections because of split votes with no successful outcome). We should rather assume that the array we store these messages in, and that we're counting, could contain anything, and then assert that it contains exactly what we expect. * Our intuition around fault tolerance might suggest that local storage faults cannot propagate to destroy global consensus. Yet they do (https://www.youtube.com/watch?v=fDY6Wi0GcPs). We need to be really careful how we repair local faults so that we do so correctly in the context of the global consensus protocol. * Finally, I think what also really helps is to have a completely deterministic consensus protocol Replica abstraction that you initialize with an abstract Message Bus, Journal and State Machine instance. This Replica instance can send messages to in-process or remote Replica instances, and has on_message() handlers for the various protocol messages that either change state and/or send messages but can never fail (i.e. no error union return type) because that amplifies the dimensionality of the code paths. For timeouts, don't use the system clock because it's not deterministic. Instead, use a Timeout abstraction that you step through by calling tick() on the Replica. With these components in place, you can build an automated random fuzzing test to simulate your distributed network and local storage fault models and test invariants along the way, outputting a deterministic seed to reproduce any random failures easily. |
VR is -not- a variation of Paxos much less the later multi-paxos.
Viewstamped Replication was developed independently from Paxos and is distinct from Paxos. (And it came out a year before Paxos):
From the author of this OP:
https://brooker.co.za/blog/2014/05/19/vr.html
"Introduced in May 1988 in Brian Oki's PhD thesis, Viewstamped Replication predates the first publication of Paxos by about a year. If you're looking for intrigue you may be disappointed: both Lamport and Liskov claim the inventions were independent."