Hacker News new | ask | show | jobs
by ryao 3694 days ago
ZIL is only used on synchronous IO. Moving it to a dedicated SLOG device would have no impact on non-synchronous IO. A SLOG device does help on synchronous IO though.

That said, all file systems degrade in performance as they fill. I do not think there is anything notable about how ZFS degrades. The most that I have heard happen is a factor of 2 sequential read performance decrease on a system where all files were written by bit torrent and the pool had reached 90% full. That used mechanical disks. A factor of 2 in a nightmare scenario is not that terrible.

1 comments

A log vdev is a log vdev (or a SLOG). ZIL is a badly overloaded term.

Ignoring logbias=throughput, when you have a slog you save on writing intents for small synchronous writes into the ordinary vdevs in the pool. If you do a lot of little synchronous writes, you can save a lot of IOPS writing their intents to the log vdev instead of the other vdevs. Log vdevs are write-only except at import (and at the end phases of scrubs and exports).

Here's the killer thing on an IOPS-constrained pool not dominated by large numbers of small synchronous writes: the reads get in the way of writes. ZFS is so good at aggegating writes that unless you are doing lots of small synchronous random writes, they write IOPS tend to vanish.

Reads are dealt with very well as well, especially if they are either prefetchable or cacheable. Random small reads are what kill ZFS performance.

Unfortunately systems dominated by lots of rsync or git or other walks of filesystems tends to produce large numbers of essentially random small reads (in particular, for all the ZFS metadata at various layers, to reach the "metadata" one thinks of at the POSIX layer). This is readily seen with Brendan Gregg's various dtrace tools for zfs.

The answer is, firstly, an ARC that is allowed to grow large, and secondly high-IOPS cache vdevs (L2ARC). l2 hit rates tend to be low compared to ARC hits, but every l2 hit is approximately one less seek on the regular vdevs, and seeks are zfs's true performance killers.

Persistent L2ARC is amazing, but has been languishing at https://reviews.csiden.org/r/267/

It has several virtues that are quickly obvious in production. Firstly, you get bursts of l2arc hits near import time, and if you have frequently traversed zfs metadata (which is likely if you have containers of some sort running on the pool shortly after import) the performance improvement is obvious. Secondly, you get better data-safety; l2arc corruption, although rare in the real world, can really ruin your day, and the checksumming in persistent l2arc is much more sound. Thirdly, it can take a very long time for large l2arcs to become hot, which make system downtown (or pool import/export) more traumatic than with l2arc (rebuilds of full ~128GiB l2arc vdevs take a couple of seconds or so on all realistic devices; even USB3 thumb drives (e.g Patriot Supersonic or Hyper-X DataTravellers, both of which I've used on busy pools) are fast and give an IOPS uptick early on after a reboot or import, and of course you can have several of those on a pool. "Real" ssds give greater IOPS still. Fifthly, the persistent l2arc being available at import time means that early writes are not stuck waiting for zfs metadata to be read in from the ordinary vdevs; that data again is mostly randomly placed LBA-wise, and small, so there will be many seeks compared the amount of data needed. Persistent l2arc is a huge win here, especially if for some reason you insist on having datasets or zvols that require DDT lookups (small synchronous high-priority reads if not in ARC or L2ARC!) at write time.

Maybe you could consider integrating it into ZoL since you guys have been busy exploring new features lately.

Finally, if you are doing bittorrent or some other system which produces temp files that are scattered somewhat randomly, there are two things you can do which will help: firstly, recordsize=1M (really; it's great for reducing write IOPS and subsequent read IOPS, and reduces pressure on the metadata in ARC), and secondly, particularly if your receives take a long time (i.e., many txgs), tell your bittorrent client to move the file to a different dataset when the file has been fully received and checked -- that will almost certainly coalesce scattered records.

The term ZIL is not overloaded. Unfortunately, users tend to misuse it because the ZIL's existence is hard to discover until it is moved into a SLOG device.

As for persistent L2ARC, it was developed for Illumos and will be ported after Illumos adopts a final version of it.