|
|
|
|
|
by DaiPlusPlus
668 days ago
|
|
> NVMe drives fail at a fairly low rate But they still fail. Backups are great and all, but for hardware-failure nothing beats redundancy (while RAID1, RAID5, etc allow for faster reads - I don't know how-often NVMe SSDs saturate their PCIe links though...). Granted, you don't need hardware RAID for that (and HostRAID is a joke, lol): we still want redundancy, but today you'd do it with ZFS or similar so you aren't locked-in to some HW RAID vendor, or suffer the ironic consequences of having non-redundant HW RAID controllers. |
|
When it drops off file-systems writes to the LVs are blocked and reads can also fail but the system survives sufficiently to do a controlled power off/on that recovers it.
In some cases the LVs pair up a spinning disk with the NVME but due to how I've configured the LV the spinner is read-mostly and the NVME is write-mostly (RAID member syncing is delayed and in background). There isn't too much noticeable latency except for things like `git log -Sneedle` - and worth it for the resilience.
[0] first time it happened it was spiders that had taken up residence around the M2 header and CPU (nice and warm!) and causing dust trails allowing current leakage between contacts (yes, I did do microscopic examination because I could not identify any other cause) that a simple blast with the air-compressor resolved. Later incidents turn out to be physical stress due to extreme thermal expansion and contraction as best as I can tell - ambient air temperature can fluctuate from 14C to 40C and back over 18 hours. Re-seating the M2 adapter fixes it for a a few months before it starts again! All NVME SMART self-tests pass; the failure is of the link not the storage - effectively being removed from the PCIe port. Firmware was at one stage suspected, although it had been fine for a couple of years on the same version, but updates haven't changed it in any way. ASPM is disabled.