Hacker News new | ask | show | jobs
by drzaiusapelord 4598 days ago
Sysadmin here. My experience:

1. Infant mortality. Drives fail after a couple months of use.

2. 3 year mark. This is where fails begin for typical work loads.

3. 4-6 year mark. This is when you can expect the drives that haven't failed earlier to fail. By this point, we're looking at 33% fail.

Interesting that my experiences roughly match up with Chart 1.

My experiences are 10 to 15k SAS drives. Slower moving 7200rpm drives? No idea. Haven't used them in servers in a while. They seem more of a crapshoot to me. SSD's, thus far, are even more of a crapshoot and we don't use them in servers and only hesitantly in desktops/laptops and only Intel.

2 comments

Agreed RE: SSD drives ...

It is very disappointing how flaky and unreliable SSD devices have been when their promise was just the opposite, due to lack of moving parts.

Back in 1999/2000 I had a habit of building some personal as well as commercial servers in datacenters with compact flash parts (plain old consumer CF drives) as boot devices with the goal of fault tolerance in mind. There was a price to be paid in that these devices needed to be mounted, and run, read-only.

But they ran forever. I never had one part fail. Just plain old CF drives mated directly to the IDE interface.

Now fast forward to 2013 and new servers we deploy for rsync.net have a boot mirror made of two SSDs ... things have gone well, but our general experience and anecdotal evidence from other parties gives us pause.

One thought: an SSD mirror, if it fails from some weird device bug or strange "wear" pattern would fail entirely, since both members of the mirror are getting the exact same treatment. For that reason, when we build SSD boot mirrors, we do so with two different parts - either one current gen and one previous gen intel part, or one intel part and one samsung part. That way if there is some strange behavior or defect or wearing issue, they both won't experience it.

They get the same writes but not the same reads, so depending on the bug source it may not hit both at the same time. The read pattern itself may affect the way the writes are performed to the flash (delaying or speeding up writes pending for commit) that it may have a butterfly effect on the rest of the behavior and removing the disks from being in sync with regard to firmware bugs.

If you'd still follow up on your idea of using a read-only root like you did with CF cards and figured a safe place for the logs you could still use the SSDs in the same mode. Why not go that route?

Yes, depending on the bug source. But the bug source might be related to reads. Nobody knows. Splitting the risk across two different vendors / implementations seems to be good insurance.
I mostly handle server appliances and the read-only boot disk is bread-and-butter for anything I do. Bonus points for using initramfs and never hitting the boot disk after the initial boot is completed.

But if you stick to boot SSDs that are read and written to using different makes sounds like a good strategy.

It would of course be hard to avoid read-only flash no matter what you did - both bios and pxe rom from the ethernet card would presumably be read only flash today (that is writeable, but in practice only used for reading).
I did exactly the same thing with CF. We had a default config that would operate read-only and leave the machine in a reachable state no matter what. Once we got that far, we'd mount some spinning rust or NFS and pivotroot and run a secondary init.

It was a huge win for uptime.

I've used mostly 7200rpm SATA and Nearline-SAS and they are mostly fine in fact, didn't play enough with higher rpm SAS yet but so far I don't see a big difference between them and the 7200 NL-SAS drives.

I'd echo the sentiment seen elsewhere in the comments about Seagate drives vs. Hitachi drives. Both for SATA and NL-SAS. Hitachi 1TB were rock solid compared to Seagate.

Completely anecdotal... but, the 640gb two platter drives have seemed to be the most rock solid.. ymmv though... With the much larger, and much more expensive (since the taiwan floods) drives, who knows anymore... it's relatively anecdotal for anyone at this point, since after 3-5 years all the warranties have expired.