Hacker News new | ask | show | jobs
by mdtancsa 212 days ago
dropping off the bus is the best case fail really. Its more annoying when writes become slower than the other disks often causing confusing performance profiles of the overall array. Having good metrics for each disk (we use telegraf) will help flag it early. On my zfs pools, monitoring disk io for each disk, smartmon metrics help tease that out. For SSDs probably the worst is when there is some firmware bug that triggers on all disks at the same time. e.g. the infamous HP SSD Failure at 32,768 Hours of Use. Yikes!
1 comments

we had ones that turned into that failure mode at like 80% life left. Zero negative SMART metrics, just slowed down.

My hunch is that they don't expose anything because that makes it harder to refund on warranty