| We thought a lot about correlated storage failures - especially with regard to SSDs - as we rebuilt our infrastructure circa 2012/2013. In the end, the low hanging fruit - or, the biggest actionable takeaway - was that when we build boot mirrors out of SSDs, they should not be identical SSDs. This was a hunch I had, personally, and I think experience and, now, results like these, bear it out. Consider: an SSD can fail in a logical way. Not because of physical stress or mechanical wear, which has all kinds of random noise in the results - but due to a particular sequence of usage. If the two SSDs are mirrored, it is possible that they receive identical usage sequences over their lifetime. ... which means they can fail identically - perhaps simultaneously. Nothing fancy or interesting about the solution: all rsync.net storage arrays have boot mirrors that mix either the current generation Intel SSD with the previous generation Intel SSD or mix an Intel SSD with a Samsung SSD. |
But nothing like SSDs.
They absolutely can and do fail near simultaneously, but it doesn't even need to be with identical use. I've had multiple SSDs from the same batch fail the same week despite being in different arrays hosting different data, albeit similar usage patterns. If you're unlucky and get a bad firmware revision, suddenly you may face a cascade of failing drives before you have time to upgrade (I particularly remember a bad time dealing with failing OCZ SSDs...)
It's terrifying. My home NAS has four different brands for that reason. And of course I never trust a single array.
Dealing with storage has done more than anything else to make me worry about hardware risks... I really don't envy you running a storage service...
EDIT: IBM DeathStar refers to this, btw: https://en.m.wikipedia.org/wiki/Deskstar - see particularly the images. It was grim.