|
|
|
|
|
by dboreham
1294 days ago
|
|
Confused as to why they didn't just replace the bad SSDs with good ones? Fwiw this sounds to me like what happens when you use "retail" SSDs (drives marketed for use in user laptops) underneath a high write traffic application such as a relational database. Often such drives will either wear out or will turn out to have pathological performance characteristics (they do something akin to GC eventually), or they just have firmware bugs. Use enterprise rated drives for an application like this. |
|
So to be clear, we did try to "offline" a drive from the ZFS pool just to see if this was a viable path. The ZFS pool was set up a few years ago and has gone through a few iterations of disks. The mirrors were unbalanced. We had pairs of drives of one manufacturer/speed mirrored with pairs of drives from another manufacturer/speed. We know this configuration was wrong, again we didn't intend for our little home lab to turn into a small production service.
I think after spending a few hours trying to "offline" the disk, and then repairing the already brittle ZFS configuration to getting the database/media store back to a "really broken and slow but still technically working" state we just decided to pull the plug and move to Hetzner. Offlining the disk caused even more cascading failures and took about 30 minutes just for the software. We could have technically shut down production to try without the database running on it, but at that point we decided to just get out of the basement.
If it would have been as easy as popping a disk in/out of the R630 (like one would imagine) we would have certainly done that.
To be honest I am still very interested in performing more analysis on ZFS on a 6.0.8 Linux kernel. I am not convinced ZFS didn't have more to do with our problems than we think. I will likely do a follow up article on benchmarking the old disks with and without ZFS in the future.
zfs-2.1.4-1 zfs-kmod-2.1.6-1 6.0.8-arch1-1