Hacker News new | ask | show | jobs
by hhw 3695 days ago
Using dedicated ZIL significantly reduces fragmentation:

http://www.racktopsystems.com/dedicated-zfs-intent-log-aka-s...

Anyone using ZFS in a serious capacity would have both dedicated ARC and ZIL.

4 comments

ZIL is only used on synchronous IO. Moving it to a dedicated SLOG device would have no impact on non-synchronous IO. A SLOG device does help on synchronous IO though.

That said, all file systems degrade in performance as they fill. I do not think there is anything notable about how ZFS degrades. The most that I have heard happen is a factor of 2 sequential read performance decrease on a system where all files were written by bit torrent and the pool had reached 90% full. That used mechanical disks. A factor of 2 in a nightmare scenario is not that terrible.

A log vdev is a log vdev (or a SLOG). ZIL is a badly overloaded term.

Ignoring logbias=throughput, when you have a slog you save on writing intents for small synchronous writes into the ordinary vdevs in the pool. If you do a lot of little synchronous writes, you can save a lot of IOPS writing their intents to the log vdev instead of the other vdevs. Log vdevs are write-only except at import (and at the end phases of scrubs and exports).

Here's the killer thing on an IOPS-constrained pool not dominated by large numbers of small synchronous writes: the reads get in the way of writes. ZFS is so good at aggegating writes that unless you are doing lots of small synchronous random writes, they write IOPS tend to vanish.

Reads are dealt with very well as well, especially if they are either prefetchable or cacheable. Random small reads are what kill ZFS performance.

Unfortunately systems dominated by lots of rsync or git or other walks of filesystems tends to produce large numbers of essentially random small reads (in particular, for all the ZFS metadata at various layers, to reach the "metadata" one thinks of at the POSIX layer). This is readily seen with Brendan Gregg's various dtrace tools for zfs.

The answer is, firstly, an ARC that is allowed to grow large, and secondly high-IOPS cache vdevs (L2ARC). l2 hit rates tend to be low compared to ARC hits, but every l2 hit is approximately one less seek on the regular vdevs, and seeks are zfs's true performance killers.

Persistent L2ARC is amazing, but has been languishing at https://reviews.csiden.org/r/267/

It has several virtues that are quickly obvious in production. Firstly, you get bursts of l2arc hits near import time, and if you have frequently traversed zfs metadata (which is likely if you have containers of some sort running on the pool shortly after import) the performance improvement is obvious. Secondly, you get better data-safety; l2arc corruption, although rare in the real world, can really ruin your day, and the checksumming in persistent l2arc is much more sound. Thirdly, it can take a very long time for large l2arcs to become hot, which make system downtown (or pool import/export) more traumatic than with l2arc (rebuilds of full ~128GiB l2arc vdevs take a couple of seconds or so on all realistic devices; even USB3 thumb drives (e.g Patriot Supersonic or Hyper-X DataTravellers, both of which I've used on busy pools) are fast and give an IOPS uptick early on after a reboot or import, and of course you can have several of those on a pool. "Real" ssds give greater IOPS still. Fifthly, the persistent l2arc being available at import time means that early writes are not stuck waiting for zfs metadata to be read in from the ordinary vdevs; that data again is mostly randomly placed LBA-wise, and small, so there will be many seeks compared the amount of data needed. Persistent l2arc is a huge win here, especially if for some reason you insist on having datasets or zvols that require DDT lookups (small synchronous high-priority reads if not in ARC or L2ARC!) at write time.

Maybe you could consider integrating it into ZoL since you guys have been busy exploring new features lately.

Finally, if you are doing bittorrent or some other system which produces temp files that are scattered somewhat randomly, there are two things you can do which will help: firstly, recordsize=1M (really; it's great for reducing write IOPS and subsequent read IOPS, and reduces pressure on the metadata in ARC), and secondly, particularly if your receives take a long time (i.e., many txgs), tell your bittorrent client to move the file to a different dataset when the file has been fully received and checked -- that will almost certainly coalesce scattered records.

The term ZIL is not overloaded. Unfortunately, users tend to misuse it because the ZIL's existence is hard to discover until it is moved into a SLOG device.

As for persistent L2ARC, it was developed for Illumos and will be ported after Illumos adopts a final version of it.

I'm using a somewhat older version of ZFS, but I tried having an SLOG (a dedicated ZIL disk) and it went essentially unused, so instead I moved the disk over to a second L2ARC, which helped a lot, as it doubled the throughput.

Further research showed that the ZIL is only needed for synchronous writes, which my workload didn't have any of.

ARC when I looked at ZOL is separate from the linux page cache and thus you get double buffering.
Only with mmap'ed files.
Why can't ZoL just not cache into ARC when mmaping then?
There is no reason why the driver cannot be patches to mmap into ARC. There are just many higher priority things to do at the moment. In terms of performance, the value of eliminating double caching of mmap'ed data is rather small compared to other things in development. Later this year, ZoL will replace kernel virtual memory backed SLAB buffers with lists of pages (the ABD patches). That will improve performance under memory pressure by making memory reclaim faster and more effective versus the current code that will ecessively evict due to SLAB fragmentation. It should also bypass the crippled kernel virtual memory allocator on 32-bit Linux that prevents ZoL from operating reliably there. Additionally, workloads that cause the kernel to frequently count all of the kernel virtual memory allocations would improve tremendously.

Mmap'ing into ARC would probably come after that as it would make mapping easier.

ARC yes, ZIL no.
In a thread that is about the perils of ZFS fragmentation, you are replying to a link saying that a ZIL seriously reduces the risk of fragmentation, and saying that someone worried about fragmentation does not need to use a ZIL.

Why? If there's a legitimate reason, please expand.

I think he meant that they might not have one.

It's been a while since I looked at using ZFS for anything meaningful, but at the time (~6 years ago), while losing L2ARC was no big deal, losing dedicated ZIL was catastrophic. I think that's still true today.

So you need at least two ZIL devices in a mirror. On top of that, you really need something faster and lower latency for your ZIL vs. the ARC or main pool; people were trying to use SSDs but most commonly-available drives at the time would either degrade or fail in a hurry under load. So the options were RAM-based, e.g. STEC ZeusRAM on the high end, or some sort of PCI-X/PCIe RAM device. The former was not easy or cheap to acquire for testing stuff, and the latter made failover configs impossible.

I think that ZIL is also not soaking up all writes, just most writes meeting a certain criteria. Some just stream through to the pool. So I was always thinking of it as a protection device that also converted random writes to sequential. Some people don't think they need that.

I remember the fragmentation issue being a problem at the time, but also thinking it was probably going to get solved soon because there was so much interest and a whole company behind it. Then Oracle happened. My guess is that if it were still Sun and all the key people were still there, this would be a solved problem right now. As it is, Oracle probably wants you to buy all the extra storage anyway, and would love to offer professional services to get you out of the fragmentation bind you're in.

A lot has changed. Well - one thing actually: you no longer lose your ZFS pool if your dedicated ZIL log (called a SLOG) dies.

Here is some info on ZIL vs SLOG: http://www.freenas.org/blog/zfs-zil-and-slog-demystified/

Your information is out of date. Losing s SLOG device while the system is running is fine. As far as I know, it has always been fine (unless someone goofed on the initial implementation long before I became involved). All data in ZIL is kept in memory, regardless of whether it is written to the main pool or to a SLOG device. The data is written to the main pool in a permanent fashion with the transaction group commit. If a SLOG device dies, that write out still happens and the pool harmlessly stops using it. If the SLOG device dies on an exported pool, you need to set the zil_replay_disable kernel module parameter to allow the pool to be imported. The same might be true if you reboot (although I doubt it, but need to check).

You can test these things for yourself.

> Anyone using ZFS in a serious capacity would have both dedicated ARC and ZIL.

I contend that most people using ZFS in a serious capacity do not have a dedicated ZIL.