Hacker News new | ask | show | jobs
by rincebrain 3695 days ago
At least on ZFS, the whole reason "defrag" is impractical is that a bunch of places in the FS structure assume the logical address of a block is immutable for the lifetime of the block, which makes a number of properties really easy and inexpensive, but also means that your life is suffering if you want to try to modify that particular constraint.

If you'd like to see some information on a feature that's been added while working around that particular constraint (or, rather, mitigating the impact of it), check out [1].

[1] - http://open-zfs.org/w/images/b/b4/Device_Removal-Alex_Reece_...

2 comments

Defragmenting a merkle tree required BPR, which temporarily breaks the structure intended to keep data safe. The only code known to have achieved it performed poorly and is behind closed doors at Oracle.

The benefits in terms of defragmentation are also limited because ZFS does a fair job of resisting fragmentation related performance penalties. The most that I would expect to see on a pool where poor performance is not caused by the best fit allocator would be a factor of two on sequential reads.

As it says in that slide deck's first slide (after the title slide), second bullet, this particular device removal technique is to deal with an "oops" where one accidentally adds a storage vdev to an existing pool.

The zpool command line utility tries hard to help you not shoot yourself in the foot, but "zpool add -f pool diskname" sometimes happens when "zpool add -f pool cache diskname" was meant. Everyone's done it once. Thinks of a system melting down because the l2arc has died, and you're trying to replace it in a hurry, and you fat-finger the attempt to get rid of the "-n" and end up getting rid of "log" instead.

Without this device removal, that essentially dooms your pool -- there is no way to back out, and the best you can do is throw hardware at the pool (attach another device fast to mirror the single device vdev, then try to grow the vdev to something temporarily useful, where "temporarily" almost always means "as long as it takes to get everything properly backed up" with the goal being the destruction and re-creation of the pool (plus restoral from backups).

With this device removal, you do not have to destroy your pool; you have simply leaked a small amount of space (possibly permanently) and will carry a seek penalty on some blocks (possibly permanently, but that's rarer) that get written to that vdev before the replacement.

As noted further in the slide deck (and in Alex's blog entries), this only works for single device vdevs -- you cannot remove anything else, like a raidz vdev, and you have to detach devices from mirror vdevs before removal.

Also, note the overheads: although you can remove a single-device vdev with a large amount of data on it, doing so is a wrecking ball to resources, particularly memory. You won't want to do something like:

Before:

mirror-0 disk0 2tb-used 3tb-disk-size disk1 2tb-used 3tb-disk-size mirror-1 disk2 2tb-used 3tb-disk-size disk3 2tb-used 3tb-disk-size

do an expand dance, so you have

mirror-0 disk0 2tb-used 6tb-disk-size disk1 2tb-used 6tb-disk-size mirror-1 disk2 2tb-used 3tb-disk-size disk3 2tb-used 3tb-disk-size

then detach disk3, then device-removal remove disk2, except in extremely special circumstances, and where you are well aware of the time it will take, the danger to the unsafe data in the pool during the removal (i.e., everything in former mirror-1), that your pool will be trashed beyond hope in the presence of crashes or errors during the removal, and that you will have a permanent expensive overhead in the pool after the removal is done.

It would almost certainly be much faster and vastly safer to make a new pool with the 6tb disks and zfs send data from the old one to the new one.

I think we're basically agreeing loudly over everything except the example being a demonstration of mitigating the impact of BPs being immutable while adding a feature that requires that statement be less than true - and I agree, the permanent overhead of a mini-DDT is a non-starter for anything other than the example case of "oops I added a device, time to evac it before $TONS_OF_DATA gets landed".

Certainly, it would be much less exciting to send|recv from poolA to poolB, and require no code changes and no GB per TB of data indirection overhead.

But this was intended as an example of how many caveats and problems are involved in even a "simple" feature involving shuffling data on-disk, and thus, why "defrag" is a horrendously hard problem in this environment.