| One thing I've learnt early on my career as a sysadmin is that disk quality is very important, and so is the quality of the RAID controller or software RAID subsystem. After you have a multiple drive failure on a supposedly safe RAID-1, and get forced into stitching it back into operation with a combination of "badblocks" and "dd", you'll quickly understand why... A good RAID controller won't let a drive with bad sectors continue to operate silently in an array. Once an unreadable sector is detected, the drive is failed immediately, period. The problem is in the detection, but good RAID controllers "scrub" the whole array periodically. If they don't, or if you are paranoid like me, the same can be accomplished by having "smartd" initiate a long "SMART" self-test on the drives every week. Good controllers will even fail drives when a transient error happens, one which triggers a bad block reallocation by the drive, for example. This is what makes some people fix failed drives by taking them out and putting them back in. After a rebuild the drive will operate normally without any errors, but you are putting yourself at a serious risk of it failing during a rebuild if another drive happens to fail, so DON'T do this. Some others will react differently to these transient errors. EMC arrays, for instance, will copy a suspicious drive to the hot-spare and call home for a replacement. This is much faster than a full rebuild, but also much safer because it doesn't increase the risk of a second drive failing while doing it. Oh, and did I mention that cheap drives lie? Avoid using desktop drives on production servers for important data, even in a RAID configuration, if you don't have some kind of replicated storage layer above your RAID layer (meaning you can afford recovering one node from backup for speed and resync with the master to make it current). |
I also found that higher-end drives lie, I used SAS Nearline drives that failed easy and often and I used standard SATA drives that were more resilient. It depends on the vendor and make. May also depend on the batch but I never found a proof for that in my work.