Hacker News new | ask | show | jobs
by joenathan 3536 days ago
This is good info to know, helps me as a sysadmin to be confident in making decisions for my customers and their data. I regularly use a tool called Crystaldiskinfo to check the SMART stats of drives. Will pay more attention to the raw values in the future.
1 comments

It's interesting that most people rely on the raw values, since the standard does not require them to be meaningful and depending on the vendor it could be anything.

I suspect this is because value, worst, threshold columns are kind of confusing to understand.

There aren't too many vendors for spinning disks, and if you have a lot of disks it doesn't take too long to see that the sector count metrics correspond to sectors. In my experience, bad sector count is a good predictor of future trouble, and running disks until they threw read errors (before we were running smart monitoring), they all had lots of bad sectors. That said, there's a threshold, getting to 100 slowly is probably ok, a thousand is probably not.

SSDs though, they just disappear from the bus when they fail; so I haven't been able to look at a dead one and see what looks like a useful predictor. I have seen some ssds reallocating a big block, which kills performance while its going on...

"SSDs though, they just disappear from the bus when they fail"

This isn't always true, and actually shouldn't ever be true - it's a particular failure mode you're seeing, and while it appears to be one common across a number of SSD controllers, it's still a pretty sorry fact that it happens.

All SSDs (at least all not-complete-rubbish ones) report some kind of flash/media wearout indicator via SMART, which isn't necessarily an imminent failure indicator (SSDs will generally continue to work long past the technical wearout point), but is a very strong indicator that you should replace it soon and should probably buy a better one next time.

SSDs do suffer from sector reallocations in the normal way, and the same kind of metric monitoring can be done. It's pretty vendor-specific as to what SMART attributes they report, but attributes like available reserved space, total flash writes, flash erase and flash write failure counts and so on are pretty common.

With thousands of sata SSDs, I've seen one fail in a traditional fashion (some sectors weren't readable, otherwise mostly fine) and the rest of the maybe hundred that failed would just disappear from the bus. I don't monitor the wear out indicators, but from occasional looking, we're never near a significant fraction of the wear capacity. I'm very happy not to have anymore spinning disks in production, because the ssds fail less often, it's just the failures are more annoying, because it's hard to have an orderly shutdown when disks disappear.
Funny how ~18 years later I still have compact flash devices plugged into IDE ports that have never failed. In fact, across a broad spectrum of applications and installs, I have never seen a working CF device fail in the field.

SSDs on the other hand ...

I use SSDs for caching (ZFS read cache and mirrored SLOGs) and I use them for mirrored boot devices in modern, production systems that should have a fast OS device.

But if I want a system to run forever ... if I am optimizing for longevity ... I use compact flash, even in 2016.

(yes, of course I set them to be read-only and disable swap)