| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by kbenson 2131 days ago

It's a factor of how quickly they can replace drives and how well redundant data is spread between disparate systems. IIRC, they make sure data is dispersed not only at the chunk and drive level, but the system and rack level (and maybe datacenter level? not sure).

At that point, if there's not contingency redundancy built in (See below), it's really a matter of how long it takes to replace a drive (in both identifying the problem, physically replacing the hardware, and replicating data to it). There's a lot of (fairly simple) math involved in running down those numbers, but based on the percentage of drives that fail in a quarter, I think it would take both a spectacular run of bad luck combined with negligence on their part in making sure redundancy levels are kept over a longer period to actually have problems.

> Is there not the danger that the quality drops drastically to the point that one would need an unreasonable number of copies?

I think the very simple way to look at this is that space capacity and automatic redundancy checking can account for a lot of bad drives. E.g. if a drive has 100 chunks of data all copied to 100-200 other drives and systems (such that there are three copies of any chunk), that the data exists three places, and if that drive dies and the system detects those 100 chunks are now only exist in two places, it can immediately locate 100 locations that have capacity to receive a chunk and start replicating data to keep the level of redundancy they need. Even if there was a very large set of bad drives, they would have to all go bad in a very short time frame, short enough that the couldn't be physically swapped out and data couldn't be copied across the network, for it to cause a problem.

At least that's how a system like this could be developed, and my understanding is that Backblaze's system works like this to some degree.