Hacker News new | ask | show | jobs
by amluto 699 days ago
Here’s my pet peeve regarding RAID: no RAID system I’ve ever used gracefully handles disks that come and go. Concretely: start with two disks in RAID1. Remove one. Mount in degraded mode. Write to a file. Unmount. Reconnect the removed disk. Mount again with both disks.

The results vary between annoying (need to restore / “resilver” and have no redundancy until it’s done; massively increased risk of data loss while doing so due to heavy IO load without redundancy and pointless loss of the redundancy that already exists) to catastrophic (outright corruption). The corollary is that RAID invariably works poorly with disks connected over using an interface that enumerates slowly or unreliably.

Yet most competent active-active database systems have no problems with this scenario!

I would love to see a RAID system that thinks of disks as nodes, properly elects leaders, and can efficiently fast-forward a disk that’s behind. A pile of USB-connected drives would work perfectly, would come up when a quorum was reached, and would behave correctly when only a varying subset of disks is available. Bonus points for also being able to run an array that spans multiple computers efficiently, but that would just be icing on the cake.

5 comments

> The results vary between annoying (need to restore / “resilver” and have no redundancy until it’s done; massively increased risk of data loss while doing so due to heavy IO load without redundancy and pointless loss of the redundancy that already exists) to catastrophic (outright corruption).

I'm not sure what you expect?

RAID1 is a simple data copy, you made sure to make both disks contain different data. So there's two outcomes possible: either the system notices this and copies A to B or B to A to reestablish the redundancy, or it fails to notice and you get corruption.

Linux MD allows for partial sync with the bitmap. If the system knows something in the first 5% of the disk changed, it can limit itself to only syncing that 5%.

> Yet most competent active-active database systems have no problems with this scenario!

Because they're not RAID. The whole point of RAID is that it's extremely simple. This means it's a brute force method with some downsides, but in exchange it's extremely easy to reason about.

I mean “RAID” in the more general sense, including btrfs, ZFS, etc, not just old-school RAID.
USB connected disks introduce new problems, like random disconnections.

RAID is overkill for home use. It also does not solve backups and snapshots. I use one way syncthing with unlimited history, plus usb-sata adapter.

Beware, ZFS often hangs on USB disconnections, forcing a reboot:

https://github.com/openzfs/zfs/issues/3461

Yes for home use I prefer more computers with single disks each having one copy of the data than one with RAID.
That means you have no bitrot protection, in fact you’ve now increased that possibility.
What's your syncthing setup?
One way sync to couple of servers. Unlimited history in syncthing for backups.
A ZFS resilver is fast if there's not much changed data, only takes a few minutes
I didn’t know that — thanks!
Does ceph not fulfill your requirements here? Especially that last "spans multiple computers" bit.
Ceph doesn’t really nail the “I want to boot off this thing” use case. It would be interesting to try, though.
Ceph provides S3-compatible object store no? If so, just use s3backer[1] with a loopback mount and boot[2] off it?

I mean, sounds like a house of cards but, should be possible?

[1]: https://github.com/archiecobbs/s3backer

[2]: https://ersei.net/en/blog/fuse-root

I'd actually recommend instead going a bit further since it'll be more reliable and easier to setup on the client side. Use the iSCSI gateway and an RBD image instead. This'll get you the availability of Ceph and is much better supported than using FUSE or s3 for booting. You can even install windows on an iSCSI target and PXE boot it (disclaimer: i've only read about this being done, not actually done it) so that you don't need any local storage at all on the remote machine.

You'll still want a fast network (I'd recommend 10gbe on the server at least, 2.5gbe on clients though 1gbe will work you will notice it bottleneck in bursts) but that won't be any different than any other network booting/rooting process

https://docs.ceph.com/en/latest/rbd/iscsi-overview/

Oh wow how ironic. I totally forgot about iSCSI, despite having used it against Ceph for testing in my home lab. Yeah, definitely go for that.

Or the S3-fuse route if you just want to geek flex.

If you're going that way, you'll be much, much happier with RBD.

https://docs.ceph.com/en/reef/rbd/

By “I want to boot off this thing” I mean that, if I have a computer with three disks, I want to run a normal-ish distro on that computer with those three disks possibly as /home or /var or maybe even as /.

I’m sure it’s possible. I don’t think this is quite what Ceph is intended for :)

ZFS will come closest.

I have a ZFS mirror, where I have taken one disk out, added files to it elsewhere, returned it and reimported.

The pool immediately resilvered the new content onto the untouched drive.

Doing this on btrfs will require a rebalance, forcing all data on the disks to be rewritten.

> Doing this on btrfs will require a rebalance, forcing all data on the disks to be rewritten.

I believe btrfs replace will copy only the data that had a replica on the failing drive.

You wouldn't replace in this context.

You would mount degraded on the remote system and copy in the new files.

After the returning the new drive, you would mount normally, but getting the new content mirrored requires a rebalance.

Replace is for a blank drive, and it hasn't worked very well for me, as status/usage reported some data that was not mirrored; a rebalance fixed this.