Hacker News new | ask | show | jobs
by CrLf 4911 days ago
One thing I've learnt early on my career as a sysadmin is that disk quality is very important, and so is the quality of the RAID controller or software RAID subsystem. After you have a multiple drive failure on a supposedly safe RAID-1, and get forced into stitching it back into operation with a combination of "badblocks" and "dd", you'll quickly understand why...

A good RAID controller won't let a drive with bad sectors continue to operate silently in an array. Once an unreadable sector is detected, the drive is failed immediately, period.

The problem is in the detection, but good RAID controllers "scrub" the whole array periodically. If they don't, or if you are paranoid like me, the same can be accomplished by having "smartd" initiate a long "SMART" self-test on the drives every week.

Good controllers will even fail drives when a transient error happens, one which triggers a bad block reallocation by the drive, for example. This is what makes some people fix failed drives by taking them out and putting them back in. After a rebuild the drive will operate normally without any errors, but you are putting yourself at a serious risk of it failing during a rebuild if another drive happens to fail, so DON'T do this.

Some others will react differently to these transient errors. EMC arrays, for instance, will copy a suspicious drive to the hot-spare and call home for a replacement. This is much faster than a full rebuild, but also much safer because it doesn't increase the risk of a second drive failing while doing it.

Oh, and did I mention that cheap drives lie?

Avoid using desktop drives on production servers for important data, even in a RAID configuration, if you don't have some kind of replicated storage layer above your RAID layer (meaning you can afford recovering one node from backup for speed and resync with the master to make it current).

1 comments

Your advice is ok for someone who is willing to take no risks and to spend the money on that. It is not strictly correct for all situations. In fact storage arrays are not likely to drop a disk on the first medium error since medium errors are a fact of life and do not necessarily indicate a bad disk. Ofcourse, given that there is a medium error it warrants a long term inspection to make sure that the medium errors are not consistent and come too often on a specific drive, that is a cause of concern but a single medium error is of no real significance.

I also found that higher-end drives lie, I used SAS Nearline drives that failed easy and often and I used standard SATA drives that were more resilient. It depends on the vendor and make. May also depend on the batch but I never found a proof for that in my work.

Maybe I was wrong in using the term "transient error"...

A bad block reallocation can be seen as a transient error from the controller's perspective, but it isn't silent provided the drive doesn't lie about it (and one would expect that a particular storage system vendor doesn't choose - and brand - drives that lie to their own controllers).

The storage system may ignore medium errors that force a repeated read (below a certain threshold), but they shouldn't ignore a medium error where the bad sector reallocation count increases afterwards (which is just another medium error threshold being hit, this time by the drive itself).

I'm not saying that higher-end drives are more reliable or not. Given that most standard SATA errors go undetected for longer, one could even argue that higher-end drives seem to fail much more frequently... I've had more FC drives replaced in a single EMC storage array than in the rest of the servers (which have a mix of internal 2.5in SAS and older 3.5in SCSI320 drives), and we certainly replace more drives in servers than desktops.

But that's another topic entirely.