Hacker News new | ask | show | jobs
by qyv 3444 days ago
Perhaps this is old information, but do SSD's from the same batch/model not tend to fail all within the same timeframe? A steady stream of failures is probably preferable to a single mass failure.
3 comments

Always mix batches almost regardless of what you buy. HDDs can and do fail like that too. I learned that the almost-hard way back in the day of the IBM Death Star [1], where we had a RAID that started failing, one drive at a time, roughly a week apart. We had reasonably up to date backups, but couldn't afford to take everything offline, so we were biting nails for weeks of continous RAID rebuilds and reduced performance and thankfully everything stayed up and we didn't get any additional failures during any of the rebuilds.

[1] https://en.wikipedia.org/wiki/Deskstar

Finding different batches can be difficult, especially if like most companies you buy your kit from one or two approved vendors.

You often hear the trope 'don't mix different disk brands in RAID' wondering if anyone knows if that's true?

They can. I had a set of Crucial SSDs which all contained the same firmware defect which took them offline after X hours of power-on time.

I also had a RAID 1 array where both SSDs failed within a couple days of each other (due to wear). That was a rude surprise. They were only six months old.

I used to write SSD firmware (not Crucial though!) and new code can always be buggy despite our best effort to test it thoroughly. However many of the SSD companies have carried their firmware through multiple generations now and the code has matured during that time so I expect this to be less of an issue. The bigger issue now will be a process shrink resulting in NAND issue that has not been identified before to be properly mitigated by the controller/firmware so I personally when a year for a product to mature before buying it.
Ah yes, the 5200 hours bug. That was fun to read about only a few days after plugging in my Crucial M4.
I went through 3 different Crucial M4s, all replaced under warranty due to unrecoverable read errors and subsequent data loss.

That model really had some serious issues at the time.

The benefit of ssd is that you can almost exactly predict when a ssd will fail.

You know the read and write IPOs. You know when they fail.

I dont know about that. Every SSD that has failed for me did so quite unexpectedly and prematurely, I haven't yet taken an SSD to its write limit and none of them have survived more than a few years except the very first Intel 80GB gen 1.
I had my first Intel 32/40gb fail just outside warranty (originally 1 year iirc). But I was so addicted to the speed difference I kept with them... The price difference isn't bad for laptop/desktop use now. But man, can't even consider it for my nas.
I've never had any storage medium fail on me, and I feel like I do a lot of read/writes.

Exception is a flash drive I snapped in half once. Oops.

Never had a failure on any of my personal laptops or PCs. I see it all the time though at work on our servers. I think what an individual considers a lot of read/writes is a drop in the bucket compared to what servers do.
Unless its the controller or firmware that goes bad... I once had three SSDs from the same batch (in different RAID sets, thankfully - we mix and match) go within a week and all had completely garbled SMART data.

But overall I prefer SSDs - just mix and match different models/manufacturers/batches in different RAID sets as for HDDs.