|
|
|
|
|
by petrosagg
1075 days ago
|
|
> Sure, we’re not yet injecting storage faults, but then formal proofs for protocols like Raft and Paxos assume that disks are perfect, and depend on this for correctness? After all, you can always run your database over RAID, right? Right?
> If your distributed database was designed before 2018, you probably couldn’t have done much. The research didn’t exist. I'm trying to understand this part but something seems off. It seems to imply that those proofs do not apply to the real world because disks are not perfect, but neither is RAM. The hardware for both RAM and disks has an inherent error rate which can be brought down to arbitrary levels by using error correction codes (e.g ECC RAM). I'm assuming that the TigerBeetle correctness proofs are predicated on perfect RAM even though in reality there is a small probability of errors. This tells me that there is an error rate which they consider negligible. If that's the case what is the difference between: * TigerBeetle's storage approach * Paxos or Raft with enough error correction on disk writes that the probability of errors equals that of ECC RAM which is considered negligible I've probably made a logical error in my reasoning but I can't see it. Can someone enlighten me? |
|
Especially spinning platter disks, not sure about SSDs.
The difference in failure rates could be orders of magnitude ... Memory will have random bit flips but I think they are pretty randomly distributed (or maybe catastrophic if there is some cosmic event)
But disks will have non-random manufacturing issues. I'd be interested in more info too, but my impression is that the data on these issues is pretty thin. Foundation DB mentioned it ~10 years ago and Google has published data >10 years ago, but hardware has changed a lot since then
Software redundancy will take care of non-correlated failures, but it fails precisely when there are correlated ones