| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by viraptor 1031 days ago
	Does it actually not update in place even for areas with a single reference? I haven't checked the source, but that sounds like fragmentation hell on spinning disks. That would absolutely kill the performance on zfs-hosted VM images / databases, which I didn't think actually happens... (Apart from the intent log, which sure, that's append only)

1 comments

rincebrain 1031 days ago

I promise you, it does not.

ZFS really deeply assumes that, when a region is in use, it will not change until it's no longer in use anywhere, and it also won't reuse things you just freed for a certain number of txgs afterward to let you get away with having to roll back a couple txgs in case of dire problems without excitement. (Since having enough writes will cause more txgs to happen faster, this isn't an issue people run into with being unable to use newly free space in practice.)

Also in practice, defining what "sequential" means with multiple disks in nontrivial topologies becomes...exciting anyway, and for writes, you only care that things are relatively, not absolutely, sequential for spinning media, and on reads, prefetch is going to notice you doing heavily sequential IO and queue things up anyway. (IMO)

If you like, you could go check on your configurations, what the DVAs for the different data blocks in your VM images are - something like zdb -dbdbdbdbdbdb [dataset] [object id, which you can get from the "inode number" of the file, or if it's a zvol, I think it's always just 1 that all the data you think of as the "disk" goes in...]

You'll almost certainly find that the regions that changed more than a couple txgs apart (the "birth=" value is the logical/physical txg the record was created) are mostly not remotely sequential.

(Nit - the two exceptions that come to mind are, the uberblocks are basically a fixed position on disk relative to the disk's size, and a fixed size, and you get [fixed size]/[minimum allocation size] of them in a ring buffer, basically, before you overwrite the oldest one, and that happens by just overwriting it, since it's technically not in use any more, someone just might want to roll back to it in a "This Should Never Happen(tm)" case...or the newly added feature of corrective send/recv, to let you feed ZFS a send stream of an "intact" copy of something that had an uncorrectable data error and have it scribble over the mangled copy with the fixed one in-place, assuming it passes the checksums.)

link

viraptor 1031 days ago

So looking at various benchmarks, reports and tuning guides, it does look like the spinning disks performance really suffers from zfs fragmentation. I haven't seen those before, but also haven't dealt with databases on zfs either. Something to keep in mind I guess.

Edit: after reviewing a few benchmarks, the outcome seems to be - even on SSD, make sure you actually want the zfs features, because ext4 will be a lot faster.

link

toast0 1030 days ago

Yeah, it's a tradeoff. Zfs gives you easy data integrity verification (and recovery if you have redundancy), easy snapshotting, easy send/recv. But you lose out on modify in place, and unified kernel memory management (at least on FreeBSD and Linux, maybe it's different on Solaris?); both of those can reduce performance, especially in certain use cases.

IMHO, zfs is a clear win for durable storage for documents and personal media. It's not a clear win for ephermeral storage for a messaging service or a CDN. If you don't mind running multiple filesystems, zfs probably makes sense for your OS and application software even if your application data should be on a different filesystem.

link

rincebrain 1030 days ago

Do you have pointers?

Because there are various mitigations and configurations involved if you're trying to do lots of small random IO for ZFS, and I've not heard people giving the advice of "just don't" in most use cases.

link

viraptor 1030 days ago

Just search for "zfs ext4 postgresql benchmark" - you'll find many of them using different configurations.

link