Hacker News new | ask | show | jobs
by linuxready 3778 days ago
In this kind of scenario, I expect the block size to have only a marginal impact. Indeed if all the CD ISOs are very similar, I would expect that the size of a duplicated chunk to be on average quite big. The difference between using 128k and 64k for BTRFS is for instance not very big.

But except for the block size, I don't see other explanation for the differences.

Dedup is dedup, so I failed to understand why results between different implementation should lead to such differences at the end (except very incorrect implementation !).

1 comments

The files on CDs are aligned to 2 kB boundaries. Dedup is looking for n kB continuous block. If the block size of the material you want to dedup does not match to the block size of dedup system, you'll get suboptimal results. The bigger the difference, the worse the results.

Say you have this data:

  ABCABCBACCBBABCC
Dedup system that has block size of 1 can see you really have just three unique blocks, A, B and C.

Same data, but dedup with block size of 2:

  AB CA BC BA CC BB AB CC
Dedup block size of 2 thinks you have 6 unique blocks: AB, CA, BC, BA, CC and BB.

Etc.

I'm sorry I am not sure I get it. Let's say you have a 1000 kB file which is duplicated and which is located on continuous blocks (so if the CDs used 2 kB boundaries, we'll have 500 continuous blocks). If ZFS use 128 kB block size, it will detect 7 blocks (896 kB) that it can deduplicate. So we only lose about 10%.

Perhaps there is a high degree of fragmentation then and files are not on continous blocks ?

(this example would be the same if instead of 2 exactly duplicated files, we have a big common chunk between 2 files)

Wrong. If the alignment is wrong, you'll likely lose 100%. 2 kB can be aligned in 64 different positions within 128 kB.

That 1000 kB of 2 kB continuous blocks must start exactly at same mod 128 kB alignment. There are 64 different possible alignments.

Oh that's it then ! Thanks for the clarification.