Hacker News new | ask | show | jobs
by chubot 1075 days ago
Not an expert in this area, but I think disks have correlated failure modes whereas CPUs and memory generally don't.

Especially spinning platter disks, not sure about SSDs.

The difference in failure rates could be orders of magnitude ... Memory will have random bit flips but I think they are pretty randomly distributed (or maybe catastrophic if there is some cosmic event)

But disks will have non-random manufacturing issues. I'd be interested in more info too, but my impression is that the data on these issues is pretty thin. Foundation DB mentioned it ~10 years ago and Google has published data >10 years ago, but hardware has changed a lot since then

Software redundancy will take care of non-correlated failures, but it fails precisely when there are correlated ones

3 comments

I work on a large distributed database system. SSDs absolutely have correlated failures. Also CMOS batteries. Also CPU and memory (think a manufacturing defect or a storage climate issue on specific batches that made it through QA). Pretty much nothing is 100% guaranteed to have no correlated failures. It comes down to probabilities. You can add flexibility, variation, vendor/sourcing diversity to reduce risks.
Went through a rather large batch of OCZ SSDs that all failed within a two week window years ago. Thankfully the IBM Death Star had long before made me allergic to putting devices of the same model in the same RAID array if I can help it, so it was a nuisance rather than a disaster.
SSDs tend to have highly correlated failure modes because you either run into a bug in the firmware which is the same on every SSD or you have the same wear on every SSD, which locks both into read-only mode within a short period of time. You might argue that read-only is not a failure, but read-only means downtime and replacing hardware.
> Memory will have random bit flips but I think they are pretty randomly distributed (or maybe catastrophic if there is some cosmic event)

Cosmic rays cause random bit flips, sure.

However, DIMMs going bad results in the same areas corrupting over and over. If you've run memtest or such with bad DIMMs, you'll see it telling you exactly which DIMM is bad, etc.

Now, dynamic memory management and virtual memory mapped onto physical memory complicate that picture... but you could easily end up with a single buffer used for e.g. TCP receive that lives in the same physical RAM region for the lifetime of the process.

Similarly, firmware bugs have resulted in very deterministic corruptions.