| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by simonz05 1404 days ago

That depends. Replicated state machines need to distinguish between a crash and corruption. Most systems don't do that. It can be disastrous to truncate the journal when encountering a checksum mismatch for instance.

See "Protocol-Aware Recovery for Consensus-Based Storage" for more research on that topic. [1][2]

- [1] https://www.usenix.org/system/files/conference/fast18/fast18... - [2] https://www.youtube.com/watch?v=fDY6Wi0GcPs

2 comments

_vvhw 1404 days ago

Indeed!

I was so glad to see you cite not only the Rebello paper but also Protocol-Aware Recovery for Consensus-Based Storage. When I read your first comment, I was about to reply to mention PAR, and then saw you had saved me the trouble!

UW-Madison are truly the vanguard where consensus hits the disk.

We implemented Protocol-Aware Recovery for TigerBeetle [1], and I did a talk recently at the Recurse Center diving into PAR, talking about the intersection of global consensus protocol and local storage engine. It's called Let's Remix Distributed Database Design! [2] and owes the big ideas to UW-Madison.

[1] https://github.com/coilhq/tigerbeetle

[2] https://www.youtube.com/watch?v=rNmZZLant9o

link

AdamProut 1404 days ago

An fsync() failing doesn't necessarily mean there is a disk corruption. I agree all logging and recovery protocols should have different handling for a corruption vs a torn tail of the log for example, but I view that as mostly orthogonal.

I'm talking about exiting the process if fsync() fails and letting the distributed databases normal failover processing do its thing. This is a normal scenario for a failover (i.e, its the same as process crashing or getting OOM killed by linux, etc).

link

ryanworl 1404 days ago

You're probably aware of this, but for the sake of others reading:

Crashing after an fsync failure isn't sufficient if you're using buffered IO either. Dirty pages in the page cache could cause your consensus implementation to e.g. allow voting for two different leaders for the same term if at some future point that machine crashes and the dirty page contained the log record for that vote. Your process would restart after the machine restarts and no longer have access to the dirty page and potentially vote again if asked.

Edit: Thought I'd add the series of steps for this to happen to make it clearer:

1. Process X receives an RPC from Process Y to vote for Y as leader in term 1.

2. Process X writes log record containing vote information to page cache and calls fsync.

3. Process X receives EIO in response to fsync (if it is lucky...) and crashes.

4. Process X restarts and receives a retry of the same RPC from Process Y for term 1 and this time it responds affirmatively because it can see it already voted in term 1 for Process Y, which _should_ be safe.

5. Process X crashes because the machine hosting X experiences a temporary hardware failure.

6. Process X restarts after the hardware failure and receives an RPC from Process Z and responds affirmatively because the dirty page that was available back at T4 never actually made it to disk.

Consensus is now (potentially) broken from electing two leaders for the same term.

link

AdamProut 1404 days ago

You could replace step 3) with "pull power plug from host" for the same effect.

Think of all the extra disk IO the worlds databases are doing to defend against step 3) :)

link

remram 1404 days ago

In distributed system terms, this is a byzantine failure which most consensus mechanisms are not equipped to handle.

link

_vvhw 1404 days ago

And yet many non-Byzantine consensus protocols are equipped to handle the network fault model, which could be seen as equally Byzantine under this definition.

The problem is really that many formal proofs of consensus have focused only on the network fault model, and neglected the storage fault model.

Both network/storage fault models require practical engineering efforts to get right. I think a better term for this is “near-Byzantine” fault tolerance. It's what non-Byzantine fault tolerance looks like when implemented correctly in the real world—the GP comment is a great example of how to approach and think about this from an engineering perspective.

I dive into this in detail also here: https://www.youtube.com/watch?v=rNmZZLant9o

link

remram 1403 days ago

"near-Byzantine" is not a very clear term you can reason about. A system is either Byzantine-fault-tolerant, in which case it handles all Bizantine faults, or it is not. A system that is tolerant to some faults (that you may want to call "Byzantine") is not BFT.

You don't call plaintext SMS "tamper-resistent" because it resists to some simple tampering. You don't call your house "FBI resistant" because you managed to convince them once to turn around.

A Byzantine fault is clearly defined as a case where a specific node may be doing anything, including not know it has failed, including malicious behavior. It is important that people know what class of faults their system is designed to resist; for Raft/Paxos, it is NOT Byzantine faults. Those systems are pretty great, but trying to pretend they aim at BFT is dangerous misinformation...

link

_vvhw 1403 days ago

What then would you specify as the clearly defined storage fault model for non-Byzantine protocols such as Paxos/RAFT that rely on stable storage for correctness?

link