| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by ChrisInEdmonton 4276 days ago
	I have heard that ZFS (at least, ZFS on Linux) requires about 1 GB of memory for every 1 TB of storage. Is this an accurate statement? That's certainly a critical flaw for many use cases, though completely irrelevant for many others. If accurate, what's behind this requirement?

4 comments

ryao 4276 days ago

On ZFS, all file data is stored using B-trees, where the leaves store X bytes, where X is the value of recordsize at file creation. When a write is done to a file in a dataset with dedup=on, a lookup is done on the data deduplication table. If the lookup finds an entry, it increments a reference. If it does not find an entry, it creates one. This involves 3 random seeks and consequentially your total write throughput is [average record size * IOPS / 3 random seeks]. If you have sufficient memory that the entire DDT can stay cached, then we avoid this limit.

The 1GB of RAM for every 1TB of storage is a rule of thumb for avoiding this limit that someone made a very long time ago. Unfortunately, that rule is wrong because it is impossible to estimate memory requirements by such a simple rule, but it has stuck with us for years despite being wrong.

The amount of memory needed to store a DDT is a function of the average record size. If your dataset has the default recorsize=128K (e.g. you are storing many >1MB files), then you can multiply your system memory by 153.6 to determine the total amount of unique data that you can store before hitting the limit I described. If you are storing many small <128KB files, then you need to calculate the average file size and then calculate [average file size * system memory * 12 / 10355] to obtain the total amount of unique data. This assumes the module settings for zfs_arc_max and zfs_arc_meta_max were left at their ZoL defaults. It is important to keep in mind that the unique data is different from total data, especially in the case of duplicate files. It also excludes metadata required for maintaining indirect block trees, dnodes (i.e. inodes), directories and the DDT itself.

The total amount of data that you can store on a pool before hitting the limit is [unique data * deduplication multipler], where unique data is what I described how to calculate and the deduplication multipler is a number that is either 1 or greater. A measure of the deduplication multipler is provided by `zpool list` as DEDUP, so you should be able to see that yourself. If your pool has data that was written without dedup=on, then any duplicates in that data will be counted as unique data for the purposes of that calculation. To provide a simple example of the deduplication multipler, imagine a pool with only two files that store the same data. The deduplication multipler for that pool would be 2, provided that both were written with dedup=on set on their dataset. If one or both were written with dedup=off, then the deduplication multiper would be 1. You can use zdb to calculate the theoretical deduplication statistics for an entire pool by running `zdb -D $POOLNAME`. Note that this will require significant memory because it constructs a full DDT in userland memory.