Hacker News new | ask | show | jobs
by rsync 3695 days ago
"Then I found out it fragments badly, and nobody can figure out how to write a defragmenter. So, uh, keep the FS below 60-80% full apparently."

Confirmed. Not FUD.

Our experience[1] is that things go to hell around 90% and even if you bring it below 90% there is a permanent performance degradation to the pool. In order to be safe, we try to keep things below 80%, just to be safe. That's probably a bit conservative, though.

ZFS needs defrag. It is not reasonable to give up 3 drives worth of capacity for the parity (raidz3, for instance) and then on top of that set aside another 10-20% as the "angels share".

[1] rsync.net

3 comments

ZFS has defragmentation built into the very design of it!

It doesn't fragment, it actually turns all random writes into sequential ones, provided there is enough space because ZFS uses copy-on-write atomic writes:

http://constantin.glez.de/blog/2010/04/ten-ways-easily-impro...

http://everycity.co.uk/alasdair/2010/07/zfs-runs-really-slow...

now, for those of us in the Solaris / illumos / SmartOS world, this is well known and well understood. We either keep 20% free in the pool, or we turn off the defrag search algorithm. But now with the Linux crowd missing out on 11 years of experience, I see there will be lots of misunderstanding of what is actually going on, and consequently, lots of misinformation, which is unfortunate.

Experienced* SunOS admins are aware of that and can still end up -- accidentally I think -- with ZFS filesystems with unacceptable performance in a state that Oracle apparently didn't understand. There was a ticket open for order months but I don't know whether it ever got resolved.

* I'm not sure how experienced, but they have Sun hardware running that's older than ZFS.

The performance degradation is likely from full meta slabs and maybe from gang blocks, although ZFS does a fair job at preventing gang blocks by using best fit behavior to minimize the external fragmentation that necessitates them. The magic threshold for best fit behavior is 96% full at the meta slab level. This tends to be where slowdowns occur. On spinning disks, being near full also means that basically all of the outermost tracks have been used, so you are limited to the inner most tracks, which can halve bandwidth.

Anyway, it would be nice if you could provide actual numbers and meta slab statistics from zdb. The worst case fragmentation that has been reported and that I can confirm from data provided to me is a factor of 2 reduction in sequential read bandwidth on a pool consisting of spinning disk after it had reached ~90% capacity. All files on it had been created out of sequence by bit torrent.

A factor of 2 might be horrible to some people. I can certainly imagine a filesystem performing many times worse though. I would be interested to hear from someone who managed to do worse than that in a manner that cannot be prescribed to the best fit allocator protecting the pool from gang block formation.

I've seen this repeated a lot, but have not had quite the same experience with "permanent" performance degradation. Especially if I eventually expand the pool with another vdev. Not sure about ZFSonLinux, but:

1) Having a ZIL helps with this, and in general. 2) ZFS changes strategy depending on how full it is, it spends more time avoiding further fragmentation rather than grabbing the first empty slot. This hit would go away if you get the free space back up. 3) Finally, there is a way[1] to have ZFS keep all the info it needs in RAM to greatly alleviate the times when it starts hunting harder to prevent more fragmentation. It looks like the RAM requirements are 32GB/1PB... so not too bad IMO.

[1] https://blogs.oracle.com/bonwick/entry/space_maps

"I've seen this repeated a lot, but have not had quite the same experience with "permanent" performance degradation. Especially if I eventually expand the pool with another vdev. Not sure about ZFSonLinux"

Look, I'll admit that we haven't done a lot of scientific comparisons between healthy pools and presumed-wrecked-but-back-below-80-percent pools ... but I know what I saw.

I think if you break the 90% barrier and either: a) get back below it quickly, or b) don't do much on the filesystem while it's above 90%, you'll probably be just fine once you get back below 90%. However, if you've got a busy, busy, churning filesystem, and you grow above 90% and you keep on churning it while above 90%, your performance problems will continue one you go back below, presuming the workload is constant.

Which makes sense ... and, anecdotally, is the same behavior we saw with UFS2 when we tune2fs'd minfree down to 0% and ran on that for a while ... freeing up space and setting minfree back to 5-6% didn't make things go back to normal ...

I am receptive to the idea that a ZIL solves this. I don't know if it does or not.

The magic threshold is 96% per meta slab. LBA weighting (which can be disabled with a kernel module parameter or its equivalent on your platform) causes metaslabs toward the front of the disk to hit this earlier. LBA weighting is great for getting maximum bandwidth out of spinning disks. It is not so great once the pool is near full. I wrote a patch that is in ZoL that disables it on solid state disk based vdevs by default where it has no benefit.

That being said, since rsync.net makes heavy use of snapshots, the snapshots would naturally keep the allocations in metaslabs toward the front of the disks pinned. That would make it a pain to get the metaslabs back below the 96% threshold. If you are okay with diminished bandwidth when the pool is empty (assuming spinning disks are used), turn off LBA weighting and the problem should become more manageable.

That said, getting data on the metaslabs from `zdb -mmm tank` would be helpful in diagnosing this.

You really shouldn't run non-CoW file systems above 90%, to include UFS and ext
Agreed. I don't think anyone is arguing that you shouldn't do it.

What I believe, and what I think others have also concluded, is that it shouldn't be fatal. That is, when the dust has settled and you trim down usage and have a decent maintenance outage, you should be able to defrag the filesystem and get back to normal.

That's not possible with ZFS because there is no defrag utility ... and I have had it explained to me in other HN threads (although not convincingly) that it might not be possible to build a proper defrag utility.

My understanding is that the way to defrag ZFS is to do a send and receive. Combined with incremental snapshotting, this should actually be realistic with almost no downtime for most environments.

Doing so requires that you have enough zfs filesystems in your pool (or enough independent pools) that you have the free space to temporarily have two copies of the filesystem.

"Doing so requires that you have enough zfs filesystems in your pool (or enough independent pools) that you have the free space to temporarily have two copies of the filesystem."

Yes, and that is why I did not mention recreating the pool as a solution. If your pool is big enough or expensive enough, that's still "fatal".

This does work.
On UNIX, there are two defragmentation utilities:

`tar` and `zfs send | zfs recv`.