| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by kris-nova 1296 days ago

Hi, I made the decision not to replace the drives. I also wrote the article, and am the admin of Hachyderm.

So to be clear, we did try to "offline" a drive from the ZFS pool just to see if this was a viable path. The ZFS pool was set up a few years ago and has gone through a few iterations of disks. The mirrors were unbalanced. We had pairs of drives of one manufacturer/speed mirrored with pairs of drives from another manufacturer/speed. We know this configuration was wrong, again we didn't intend for our little home lab to turn into a small production service.

I think after spending a few hours trying to "offline" the disk, and then repairing the already brittle ZFS configuration to getting the database/media store back to a "really broken and slow but still technically working" state we just decided to pull the plug and move to Hetzner. Offlining the disk caused even more cascading failures and took about 30 minutes just for the software. We could have technically shut down production to try without the database running on it, but at that point we decided to just get out of the basement.

If it would have been as easy as popping a disk in/out of the R630 (like one would imagine) we would have certainly done that.

To be honest I am still very interested in performing more analysis on ZFS on a 6.0.8 Linux kernel. I am not convinced ZFS didn't have more to do with our problems than we think. I will likely do a follow up article on benchmarking the old disks with and without ZFS in the future.

zfs-2.1.4-1 zfs-kmod-2.1.6-1 6.0.8-arch1-1

2 comments

rbanffy 1295 days ago

> We had pairs of drives of one manufacturer/speed mirrored with pairs of drives from another manufacturer/speed.

The different speed is an issue, but I always recommend mixing pairs so that you don’t end up like me, when all spinning metal of the same RAID-5 array failed in a short period. Wasn’t a great day.

Lucky me I had a contingency plan.

link

ilyt 1296 days ago

Throw ZFS away, put X drives, make RAID10+LVM with X-1 drives (linux supports odd numbers in RAID10), never think about it again. It's simple to setup, simple to debug, and you don't need ZFS expert for something as simple as disk replacement. In cases like what happened there is --write-mostly option that will tell linux raid to prefer other disks for reads so yo can see whether unloading the drive changes anything. Maybe RAID6 if you're not screaming for performance but want some more space.

Focus your efforts on making robust backups instead. You don't want to be that only guy in org who knows how to do ZFS things when it breaks.

We're running few racks of servers, ZFS is delegated to big boxes of spinning rust where its benefits (deduplication/compression) are used well, but on a bunch of SSDs it is just overkill.

link