Hacker News new | ask | show | jobs
by baruch 4918 days ago
Your advice is ok for someone who is willing to take no risks and to spend the money on that. It is not strictly correct for all situations. In fact storage arrays are not likely to drop a disk on the first medium error since medium errors are a fact of life and do not necessarily indicate a bad disk. Ofcourse, given that there is a medium error it warrants a long term inspection to make sure that the medium errors are not consistent and come too often on a specific drive, that is a cause of concern but a single medium error is of no real significance.

I also found that higher-end drives lie, I used SAS Nearline drives that failed easy and often and I used standard SATA drives that were more resilient. It depends on the vendor and make. May also depend on the batch but I never found a proof for that in my work.

1 comments

Maybe I was wrong in using the term "transient error"...

A bad block reallocation can be seen as a transient error from the controller's perspective, but it isn't silent provided the drive doesn't lie about it (and one would expect that a particular storage system vendor doesn't choose - and brand - drives that lie to their own controllers).

The storage system may ignore medium errors that force a repeated read (below a certain threshold), but they shouldn't ignore a medium error where the bad sector reallocation count increases afterwards (which is just another medium error threshold being hit, this time by the drive itself).

I'm not saying that higher-end drives are more reliable or not. Given that most standard SATA errors go undetected for longer, one could even argue that higher-end drives seem to fail much more frequently... I've had more FC drives replaced in a single EMC storage array than in the rest of the servers (which have a mix of internal 2.5in SAS and older 3.5in SCSI320 drives), and we certainly replace more drives in servers than desktops.

But that's another topic entirely.