| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by rsync 1919 days ago

We thought a lot about correlated storage failures - especially with regard to SSDs - as we rebuilt our infrastructure circa 2012/2013.

In the end, the low hanging fruit - or, the biggest actionable takeaway - was that when we build boot mirrors out of SSDs, they should not be identical SSDs.

This was a hunch I had, personally, and I think experience and, now, results like these, bear it out.

Consider: an SSD can fail in a logical way. Not because of physical stress or mechanical wear, which has all kinds of random noise in the results - but due to a particular sequence of usage. If the two SSDs are mirrored, it is possible that they receive identical usage sequences over their lifetime.

... which means they can fail identically - perhaps simultaneously.

Nothing fancy or interesting about the solution: all rsync.net storage arrays have boot mirrors that mix either the current generation Intel SSD with the previous generation Intel SSD or mix an Intel SSD with a Samsung SSD.

3 comments

vidarh 1919 days ago

I've seen highly correlated failures on regular hard-drives too. We had a large array of IBM DeathStars that failed approximately one every couple of weeks until the entire array had been replaced, for example.

But nothing like SSDs.

They absolutely can and do fail near simultaneously, but it doesn't even need to be with identical use. I've had multiple SSDs from the same batch fail the same week despite being in different arrays hosting different data, albeit similar usage patterns. If you're unlucky and get a bad firmware revision, suddenly you may face a cascade of failing drives before you have time to upgrade (I particularly remember a bad time dealing with failing OCZ SSDs...)

It's terrifying. My home NAS has four different brands for that reason. And of course I never trust a single array.

Dealing with storage has done more than anything else to make me worry about hardware risks... I really don't envy you running a storage service...

EDIT: IBM DeathStar refers to this, btw: https://en.m.wikipedia.org/wiki/Deskstar - see particularly the images. It was grim.

link

renox 1918 days ago

> I've seen highly correlated failures on regular hard-drives too.

Yes, very mysterious too, until you discover that when someone made a big hole in the wall of the room containing the HDD storage bay and that the HDDs are covered by dust!

It was a looong time ago but I think I'll never forget opening the door and looking at the mess..

link

bluetwo 1919 days ago

So the same reason you don't marry your cousin is the same reason why you don't backup your primary data to a second drive from the same batch: It amplifies the defects.

link

waterhouse 1919 days ago

This is a good thing to do.

For higher-hanging fruit, if you don't have enough different models of drives to make them all unique, then you might still try to protect against a run of manufacturing defects. Suppose there was a slightly defective machine making a series of drives with a certain problem. If you do things like buy drives in different groups from different middlemen or at different times, and either take one from each group or put them into a big pool and grab them at random, then that decreases the likelihood of having multiple drives from a single defective run end up in the same array.

link