Hacker News new | ask | show | jobs
by Aachen 592 days ago
That being:

> As we’ve seen from the last 7000+ words, the overheads are not trivial. Even with all these changes, you still need to have a lot of deduplicated blocks to offset the weight of all the unique entries in your dedup table. [...] what might surprise you is how rare it is to find blocks eligible for deduplication are on most general purpose workloads.

> But the real reason you probably don’t want dedup these days is because since OpenZFS 2.2 we have the BRT (aka “block cloning” aka “reflinks”). [...] it’s actually pretty rare these days that you have a write operation coming from some kind of copy operation, but you don’t know that came from a copy operation. [...] [This isn't] saving as much raw data as dedup would get me, though it’s pretty close. But I’m not spending a fortune tracking all those uncloned and forgotten blocks.

> [Dedup is only useful if] you have a very very specific workload where data is heavily duplicated and clients can’t or won’t give direct “copy me!” signal

The section labeled "summary" imo doesn't do the article justice by being fairly vague. I hope these quotes from near the end of the article give a more concrete idea of why (not) use it

1 comments

> offset the weight of all the unique entries in your dedup table

Didn't read the 7000 words... But isn't the dedup table in the form of a bunch of bloom filters so the whole dedup table can be stored with ~1 bit per block?

When you know there is likely a duplicate, you can create a table of blocks where there is a likely duplicate, and find all the duplicates in a single scan later.

That saves having massive amounts of accounting overhead storing any per-block metadata.