|
|
|
|
|
by petrosagg
1076 days ago
|
|
Thank you for the detailed response! > However, while checksums can be used under the “Crash Consistency Model” to solve consistency through power loss, PAR showed that checksums are not sufficient to be able to distinguish between a torn write at the end of the (uncommitted) WAL caused by power loss, and a torn write in the middle of the (committed) WAL caused by bitrot. The PAR paper states that "although Crash preserves safety, it suffers from severe unavailability". I assume that when TigerBeetle loads state from RAM into a CPU cache/register it operates under the NoDetection consistency model or the Crash consistency model if ECC RAM automatically resets the CPU on read errors. At the same time it doesn't suffer from severe unavailability so what gives? The answer is probably that ECC RAM is just reliable enough that the NoDetection/Crash models are fine in practice. I can believe that off-the-shelf checksum and redundancy options offered by filesystems like ext4 and ZFS or systems like RAID don't hit the required error probabilities but why does the argument stop there? Couldn't a distributed database generate error correcting data on every write in the application layer so that the probability becomes low enough such that NoDetection/Crash become a non-issue for storage, just like RAM? Is there some other fundamental difference between reading and write data from RAM versus a disk? |
|
The crux of the problem: How do you solve misdirected read/write I/O? Where the firmware writes/reads to/from the wrong disk sector (but with a valid checksum)?
PAR shows how both global consensus protocol and local storage engine need to be modified for this, with foundational design changes at the protocol-level, if a distributed system is to not only preserve correctness, but also optimize for high availability.
Bear in mind that PAR is not only actually correct, but it's also more efficient than simply dialing up local redundancy, because it lets you recover from the global redundancy that you have via replication in the consensus protocol.
The paper is great, but will especially reward a few passes of reading. The examples they give take time, but are great to work through slowly to gain a deeper understanding.
And/or, you can read the Zig code of PAR in TB! :)
Here's a great place to start, one of our favorite pieces of code in TigerBeetle: https://github.com/tigerbeetle/tigerbeetle/blob/4aca8a22627b...