Hacker News new | ask | show | jobs
by jasonzemos 1525 days ago
There is no reason to rely on "triple replication" for data integrity. This has long been a solved problem. An appropriate erasure encoding can reduce the probability of loss ~ten-fold while consuming physically less space (i.e. 2x worth of replication). Companies forego this technology because they feel confident in their operational ability to address failures quickly and competently. That's what we're relying on for data integrity, not the math.
2 comments

This email doesn't detail the integrity scheme, only that data was protected if no more than two disks failed, and more than two disks failed.

That could be three copies at 3x the storage cost, or it could be a RAID6 or raidz2 style system where the storage cost is essentially two disks out of the storage set size (which we don't know).

You could certainly increase the required failure count by increasing the parity, but if the problem was something like all the disks in the storage system hit the same fatal disk firmware bug at the same time (as speculated elsewhere in the thread, and is unfortunately plausible), then that doesn't really help much.

Erasure codes with lots of redundant drives is how online.net's C14 storage product works, but it is much more hassle than ordinary replication, since retrieval requires reading data from a lot of different drives to reassemble the shares, and similarly for splitting storage requests into shares and recording them. C14 is a Glacier-like product where you request a retrieval and your data is restore as an S3 object sometime (up to hours) later. That makes it easier, I expect.

I agree that this would be good as a last-ditch backup for stuff like Hetzner cloud snapshots, but the primary storage for those snapshots probably has to be ordinary RAID.

Fwiw I have a bunch of data in Hetzner Storagebox and have gotten several notices in the past few months that Storagebox would be temporarily offline for maintenance, which I assume meant raid rebuilds and/or drive swaps. The most recent of these was an "urgent" maintenance with less notice time than usual. It hasn't caused me any inconvenience, but I wonder if they have suffered a spate of drive failures recently.