Hacker News new | ask | show | jobs
by ChuckMcM 3695 days ago
So to understand why this is you have to appreciate the goals behind write-anywhere-file-layout (aka WAFL) file systems. [1]

One of the goals of such systems is that copy of the file system on disk is always consistent, turn power off at any point and you can come right back up with a valid file system. This is accomplished by only writing to the 'free block list'. You construct updated inodes from the file change all the way up to the root inode out of new blocks and then to "step" forward you write a new root block. This is really neat and it means that when you've done that step, you still have the old inodes and datablocks around, they just aren't linked but you can link them to another "holder" inode attached to the name ".snapshot" and it will show you the file system just before the change. Write the old root block back into the real root block and "poof!" you have reverted the file system back to the previous snapshot.

Ok, so that is pretty sweet and really awesome in a lot of ways, but it has a couple of problems. The first, as noted, is that it pretty much guarantees fragmentation as its always reaching for free blocks and they can be anywhere. On NetApp boxes of old, that wasn't too much of a big deal because everything was done "per RAID stripe" so you were fragmented, but you were also reading/writing full stripes in RAID so you had the bandwidth you needed and fragmentation was absorbed by the efficiencies of full stripe reads/writes. But the second issue arises when you start getting close to full, managing the free block list gets harder and harder. You are constantly getting low block pressure, so you are constantly trying to reclaim old blocks (on unused snapshots, or expired ones) and that leads to a big drop in performance. The math is you can't change more of the data between snapshot steps than the amount of space you have free. That is why NetApp filers would get cranky using them in build environments where automated builds would delete scads of intermediate files, only to rebuild them and then relink them. Big percentage change in the overall storage.

On the positive side, storage is pretty darn cheap these days, so a swapping in 3TB drives instead of 2TB drives means you could use all the storage you "planned" to use and keep the drives at 66% occupancy. Hard on users though who will yell at you "It says it has 10TB of storage and is only using 6TB but you won't expand my quota?" At such times it would be useful for the tools to lie but that doesn't happen.

[1] Disclosure 1, I worked for 5 years at NetApp with systems that worked this way. Disclosure 2, an intern with NetApp (we'll call him Matt) was very impressed with this and went on to work at Sun for Jeff and similar solutions appeared in ZFS.

2 comments

"One of the goals of such systems is that copy of the file system on disk is always consistent."

Goal yes, implementation no. WAFL does in fact have consistency problems and filers do ship with a consistency checker called "wack" which if you ever need this tool you'll probably have better luck throwing the filer in the trash and restoring from backups rather than waiting a month for it to complete.

Why not 'defrag' the free list during low io so this issue is somewhat mitigated?
At least on ZFS, the whole reason "defrag" is impractical is that a bunch of places in the FS structure assume the logical address of a block is immutable for the lifetime of the block, which makes a number of properties really easy and inexpensive, but also means that your life is suffering if you want to try to modify that particular constraint.

If you'd like to see some information on a feature that's been added while working around that particular constraint (or, rather, mitigating the impact of it), check out [1].

[1] - http://open-zfs.org/w/images/b/b4/Device_Removal-Alex_Reece_...

Defragmenting a merkle tree required BPR, which temporarily breaks the structure intended to keep data safe. The only code known to have achieved it performed poorly and is behind closed doors at Oracle.

The benefits in terms of defragmentation are also limited because ZFS does a fair job of resisting fragmentation related performance penalties. The most that I would expect to see on a pool where poor performance is not caused by the best fit allocator would be a factor of two on sequential reads.

As it says in that slide deck's first slide (after the title slide), second bullet, this particular device removal technique is to deal with an "oops" where one accidentally adds a storage vdev to an existing pool.

The zpool command line utility tries hard to help you not shoot yourself in the foot, but "zpool add -f pool diskname" sometimes happens when "zpool add -f pool cache diskname" was meant. Everyone's done it once. Thinks of a system melting down because the l2arc has died, and you're trying to replace it in a hurry, and you fat-finger the attempt to get rid of the "-n" and end up getting rid of "log" instead.

Without this device removal, that essentially dooms your pool -- there is no way to back out, and the best you can do is throw hardware at the pool (attach another device fast to mirror the single device vdev, then try to grow the vdev to something temporarily useful, where "temporarily" almost always means "as long as it takes to get everything properly backed up" with the goal being the destruction and re-creation of the pool (plus restoral from backups).

With this device removal, you do not have to destroy your pool; you have simply leaked a small amount of space (possibly permanently) and will carry a seek penalty on some blocks (possibly permanently, but that's rarer) that get written to that vdev before the replacement.

As noted further in the slide deck (and in Alex's blog entries), this only works for single device vdevs -- you cannot remove anything else, like a raidz vdev, and you have to detach devices from mirror vdevs before removal.

Also, note the overheads: although you can remove a single-device vdev with a large amount of data on it, doing so is a wrecking ball to resources, particularly memory. You won't want to do something like:

Before:

mirror-0 disk0 2tb-used 3tb-disk-size disk1 2tb-used 3tb-disk-size mirror-1 disk2 2tb-used 3tb-disk-size disk3 2tb-used 3tb-disk-size

do an expand dance, so you have

mirror-0 disk0 2tb-used 6tb-disk-size disk1 2tb-used 6tb-disk-size mirror-1 disk2 2tb-used 3tb-disk-size disk3 2tb-used 3tb-disk-size

then detach disk3, then device-removal remove disk2, except in extremely special circumstances, and where you are well aware of the time it will take, the danger to the unsafe data in the pool during the removal (i.e., everything in former mirror-1), that your pool will be trashed beyond hope in the presence of crashes or errors during the removal, and that you will have a permanent expensive overhead in the pool after the removal is done.

It would almost certainly be much faster and vastly safer to make a new pool with the 6tb disks and zfs send data from the old one to the new one.

I think we're basically agreeing loudly over everything except the example being a demonstration of mitigating the impact of BPs being immutable while adding a feature that requires that statement be less than true - and I agree, the permanent overhead of a mini-DDT is a non-starter for anything other than the example case of "oops I added a device, time to evac it before $TONS_OF_DATA gets landed".

Certainly, it would be much less exciting to send|recv from poolA to poolB, and require no code changes and no GB per TB of data indirection overhead.

But this was intended as an example of how many caveats and problems are involved in even a "simple" feature involving shuffling data on-disk, and thus, why "defrag" is a horrendously hard problem in this environment.