Hacker News new | ask | show | jobs
by gamegoblin 3637 days ago
Exactly. They say they are using some variant of Reed-Solomon erasure coding. If you did something like K=100 and N=150 and stored all of the shards on different disks, the probability you lose data is equal to the probability that 50 hard drives fail before you can repair the lost shards.

If I am reading the article correctly, they claim that they should usually be able to repair in less than an hour in the case of disk failure.

Thus the probability of losing 50 (or whatever their N-K value is) disks within an hour is how you get 27 nines of durability.

Of course, the probability that one of your software engineers introduces a durability bug is WAY more likely than those disks experiencing a coordinated failure.

Or say, the probability that a terrorist organization targets your datacenters. Even if those odds are one in a billion, that's still not even close to 27 nines.

1 comments

For sure. I hope we're all agreeing here :)

We've very strong believers that an effective replication strategy is just table stakes and that from there the real risks to durability are the "black swan" events that are much harder to model.

I gave at talk at Data@Scale recently where the main premise is about "Durability Theater" and how to combat it in a production storage system. In case you're interested: https://code.facebook.com/posts/253562281667886/data-scale-j...