Hacker News new | ask | show | jobs
by Arnavion 1190 days ago
I use a WebDAV server for storing backups (Fastmail Files). The server allows 10GB usage, but max file size is 250MB, and in any case WebDAV does not support partial writes. So writing a file requires reuploading it, which is the same situation as S3.

What I did is:

1. Create 10000 files, each of 1MB size, so that the total usage is 10GB.

2. Mount each file as a loopback block device using `losetup`.

3. Create a RAID device over the 10000 loopback devices with `mdadm --build --level=linear`. This RAID device appears as a single block device of 10GB size. `--level=linear` means the RAID device is just a concatenation of the underlying devices. `--build` means that mdadm does not store metadata blocks in the devices, unlike `--create` which does. Not only would metadata blocks use up a significant portion of the 1MB device size, but also I don't really need mdadm to "discover" this device automatically, and also the metadata superblock does not support 10000 devices anyway (the max is 2000 IIRC).

4. From here the 10GB block device can be used as any other block device. In my case I created a LUKS device on top of this, then an XFS filesystem on the top of the LUKS device, then that XFS filesystem is my backup directory.

So any modification of files in the XFS layer eventually results in some of the 1MB blocks at the lowest layer being modified, and only those modified 1MB blocks need to be synced to the WebDAV server.

(Note: SI units. 1KB == 1000B, 1MB == 1000KB, 1GB == 1000MB.)

5 comments

Of course, despite working on this for a week I only now discovered this... dm_linear is an easier way than mdadm to concatenate the loopback devices into a single device. Setting up the table input to `dmsetup create`'s stdin is more complicated than just `mdadm --build ... /dev/loop1{0000..9999}`, but it's all scripted anyway so it doesn't matter. And `mdadm --stop` blocks for multiple minutes for some unexplained reason, whereas `dmcreate remove` is almost instantaneous.

One caveat is that my 1MB (actually 999936B) block devices have 1953 sectors (999936B / 512B) but mdadm had silently only used 1920 sectors from each. In my first attempt at replacing mdadm with dm_linear I used 1953 as the number of sectors, which led to garbage when decrypted with dm_crypt. I discovered mdadm's behavior by inspecting the first two loopback devices and the RAID device in xxd. Using 1920 as the number of sectors fixed that, though I'll probably just nuke the LUKS partition and rebuild it on top of dm_linear with 1953 sectors each.

What a coincidence, I just recently did something similar.

Did you run into any problems with discard/zeroing/trim support?

This was a problem with sshfs — I can’t change the version/settings on the other side, and files seemed to simply grow and become more fragmented.

I suspected WebDAV and Samba might have had been the solution but never looked into it since sshfs is so solid.

Upon reading this idea I created https://github.com/lrvl/PosixSyncFS - feel free to comment
I did create the block files as sparse originally (using `truncate`), but at some point in the process they became realized on disk. Don't know if it was the losetup or the mdadm or the cryptsetup. I didn't really worry about it, since the block files need to be synced to the WebDAV server in full anyway.
Ahh OK, I think I see -- since the block files are synced in full, you are always swapping blocks and doing ~1MB of writing no matter what.

> I use a WebDAV server for storing backups (Fastmail Files). The server allows 10GB usage, but max file size is 250MB, *and in any case WebDAV does not support partial writes*. So writing a file requires reuploading it, which is the same situation as S3.

This is the part I absolutely missed. I was wondering how you were ensuring 1MB writes -- whether it was at the XFS level or mdraid level...

I think another thing that is missing which I'm inferring (hopefully correctly) is that you've mounted your webdav server to disk. So your stack is:

- LUKS

- mdraid

- losetup

- webdav fs mount

Is that correct?

The stack is XFS inside cryptsetup inside mdraid on top of losetup. The directory containing the losetup block files could be `rclone mount`'d from the WebDAV server, but that would make the setup unavailable if I didn't have network access. So instead I chose to have the block files in a regular directory, and I make sure to `rclone sync` that directory to the WebDAV server when I make changes in the XFS layer. Manually syncing also lets me run `sync` and watch the `rclone sync` output, which gives me greater confidence that all the layers have synced successfully.

>Ahh OK, I think I see -- since the block files are synced in full, you are always swapping blocks and doing ~1MB of writing no matter what.

Right. Let's say I update two files in the XFS layer. Those writes eventually result in three blocks in the lowest layer being modified. So now the `rclone sync` will need to do a `PUT` request to replace those three blocks on the WebDAV server, which means it'll upload 3MB of data to the server.

Thanks for the explanation, this makes perfect sense now, didn't realize the syncing was manual/separate.
If they're using LUKS then I think trimming/discard won't be possible.
My immediate instinct was that LUKS could issue trim/discard.

It looks like there's some anecdotal evidence out there that LUKS can discard

https://superuser.com/questions/124310/does-luks-encryption-... https://unix.stackexchange.com/questions/341442/luks-discard...

My question is more for the mdraid at the bottom of the stack than anything. I'm also a little curious about performance of something webdav vs. samba vs. sshfs (sshfs usually wins out and webdav does not strike me as particularly efficient)

Wouldn't the blocks all be cached locally for the most part? WebDAV is being used as a write behind log/backup. It should be as fast as local access through a file system created over mdraid loopback block devices ...
you're right (see the sibling comment chain), I didn't realize this was just being done on local disk with periodic backup, thought webdav was below it all!
FWIW this is similar to Apple's "sparse image bundle" feature, where you can create a disk image that internally is stored in 1MB chunks (the chunk size is probably only customizable via the command line `hdiutil` not the UI). You can encrypt it and put a filesystem on top of it.
Are you using davfs2 to mount the 1MB files from the WebDAV server?
I started out with davfs2 but it was a) very slow at uploading for some reason, b) there was no way to explicitly sync it so I had to either wait a minute for some internal timer to trigger the sync or to unmount it, and c) it implements writes by writing to a cache directory in /var/cache, which was just a redundant 10GB copy of the data I already have.

I use `rclone`. Currently rclone doesn't support the SHA1 checksums that Fastmail Files implements. I have a PR for that: https://github.com/rclone/rclone/pull/6839

Thanks for the response.

So you are using rclone sync to periodically push changes locally up to the webdav server?

This is a very nice solution.