Hacker News new | ask | show | jobs
by guenthert 2131 days ago
> We have control over reliability as our systems are designed to deal with drive failure.

That surely assumes an upper limit of the likelihood of a drive failure. There was a perception that the quality of 3.5" floppy disks declined drastically in the early 21st century. Must we not fear something similar for spinning-rust hard drives once most everyone uses SSDs?

1 comments

Briefly, any drive (floppy disk or tape drive) has some likelihood of failure. You can minimize loss of data (the reliability being discussed) by replicating data in more than one storage item. It just becomes a matter of how many you buy (and how good you are at keeping them all properly organized).
Today reliability is sufficient that one can meet a given data availability goal by replicating the data 2, 3, 4, 5, <whatever> times as there is only once in a blue moon a bad batch of drives when they tried out a new bearing lubricant or so. But what if the economic incentives decline, the marked breaks apart (as it arguably does), much like it happened for floppy disks once they were (perceived as) obsolete and used only in fringe application (HP logic analyzers come to mind, but also Boing airplanes). Is there not the danger that the quality drops drastically to the point that one would need an unreasonable number of copies?
> Today reliability is sufficient that one can meet a given data availability goal by replicating the data 2, 3, 4, 5, <whatever> times as there is only once in a blue moon a bad batch of drives when they tried out a new bearing lubricant or so.

Unless you have an uptime bug in the firmware where all your drives die at once:

* https://www.zdnet.com/article/hpe-tells-users-to-patch-ssds-...

It's a factor of how quickly they can replace drives and how well redundant data is spread between disparate systems. IIRC, they make sure data is dispersed not only at the chunk and drive level, but the system and rack level (and maybe datacenter level? not sure).

At that point, if there's not contingency redundancy built in (See below), it's really a matter of how long it takes to replace a drive (in both identifying the problem, physically replacing the hardware, and replicating data to it). There's a lot of (fairly simple) math involved in running down those numbers, but based on the percentage of drives that fail in a quarter, I think it would take both a spectacular run of bad luck combined with negligence on their part in making sure redundancy levels are kept over a longer period to actually have problems.

> Is there not the danger that the quality drops drastically to the point that one would need an unreasonable number of copies?

I think the very simple way to look at this is that space capacity and automatic redundancy checking can account for a lot of bad drives. E.g. if a drive has 100 chunks of data all copied to 100-200 other drives and systems (such that there are three copies of any chunk), that the data exists three places, and if that drive dies and the system detects those 100 chunks are now only exist in two places, it can immediately locate 100 locations that have capacity to receive a chunk and start replicating data to keep the level of redundancy they need. Even if there was a very large set of bad drives, they would have to all go bad in a very short time frame, short enough that the couldn't be physically swapped out and data couldn't be copied across the network, for it to cause a problem.

At least that's how a system like this could be developed, and my understanding is that Backblaze's system works like this to some degree.

> (HP logic analyzers come to mind, but also Boing airplanes)

Typo of the year, "Boing 737MAX" sounds more like a basketball than something I'd want to fly in.