| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by craigyk 3695 days ago

I've seen this repeated a lot, but have not had quite the same experience with "permanent" performance degradation. Especially if I eventually expand the pool with another vdev. Not sure about ZFSonLinux, but:

1) Having a ZIL helps with this, and in general. 2) ZFS changes strategy depending on how full it is, it spends more time avoiding further fragmentation rather than grabbing the first empty slot. This hit would go away if you get the free space back up. 3) Finally, there is a way[1] to have ZFS keep all the info it needs in RAM to greatly alleviate the times when it starts hunting harder to prevent more fragmentation. It looks like the RAM requirements are 32GB/1PB... so not too bad IMO.

[1] https://blogs.oracle.com/bonwick/entry/space_maps

1 comments

rsync 3695 days ago

"I've seen this repeated a lot, but have not had quite the same experience with "permanent" performance degradation. Especially if I eventually expand the pool with another vdev. Not sure about ZFSonLinux"

Look, I'll admit that we haven't done a lot of scientific comparisons between healthy pools and presumed-wrecked-but-back-below-80-percent pools ... but I know what I saw.

I think if you break the 90% barrier and either: a) get back below it quickly, or b) don't do much on the filesystem while it's above 90%, you'll probably be just fine once you get back below 90%. However, if you've got a busy, busy, churning filesystem, and you grow above 90% and you keep on churning it while above 90%, your performance problems will continue one you go back below, presuming the workload is constant.

Which makes sense ... and, anecdotally, is the same behavior we saw with UFS2 when we tune2fs'd minfree down to 0% and ran on that for a while ... freeing up space and setting minfree back to 5-6% didn't make things go back to normal ...

I am receptive to the idea that a ZIL solves this. I don't know if it does or not.

link

ryao 3694 days ago

The magic threshold is 96% per meta slab. LBA weighting (which can be disabled with a kernel module parameter or its equivalent on your platform) causes metaslabs toward the front of the disk to hit this earlier. LBA weighting is great for getting maximum bandwidth out of spinning disks. It is not so great once the pool is near full. I wrote a patch that is in ZoL that disables it on solid state disk based vdevs by default where it has no benefit.

That being said, since rsync.net makes heavy use of snapshots, the snapshots would naturally keep the allocations in metaslabs toward the front of the disks pinned. That would make it a pain to get the metaslabs back below the 96% threshold. If you are okay with diminished bandwidth when the pool is empty (assuming spinning disks are used), turn off LBA weighting and the problem should become more manageable.

That said, getting data on the metaslabs from `zdb -mmm tank` would be helpful in diagnosing this.

link

kev009 3695 days ago

You really shouldn't run non-CoW file systems above 90%, to include UFS and ext

link

rsync 3694 days ago

Agreed. I don't think anyone is arguing that you shouldn't do it.
What I believe, and what I think others have also concluded, is that it shouldn't be fatal. That is, when the dust has settled and you trim down usage and have a decent maintenance outage, you should be able to defrag the filesystem and get back to normal.
That's not possible with ZFS because there is no defrag utility ... and I have had it explained to me in other HN threads (although not convincingly) that it might not be possible to build a proper defrag utility.

link

DanielDent 3694 days ago

My understanding is that the way to defrag ZFS is to do a send and receive. Combined with incremental snapshotting, this should actually be realistic with almost no downtime for most environments.
Doing so requires that you have enough zfs filesystems in your pool (or enough independent pools) that you have the free space to temporarily have two copies of the filesystem.

link

rsync 3694 days ago

"Doing so requires that you have enough zfs filesystems in your pool (or enough independent pools) that you have the free space to temporarily have two copies of the filesystem."
Yes, and that is why I did not mention recreating the pool as a solution. If your pool is big enough or expensive enough, that's still "fatal".

link

ryao 3694 days ago

You ought to define what is fatal here. The worst that I have seen reported at 90% full is a factor of 2 on sequential reads off mechanical disks, which is acceptable to most people. Around that point, sequential writes should also suffer similarly from writes going to the inner most tracks.

link

DanielDent 3694 days ago

(1) I'm not proposing recreating the pool - I'm proposing an approach to incrementally fixing the pool in an entirely online manner.
(2) If your pool is big enough/expensive enough, surely you've also budgeted for backups.

link

ryao 3694 days ago

This does work.

link

Annatar 3694 days ago

On UNIX, there are two defragmentation utilities:
`tar` and `zfs send | zfs recv`.

link