| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by mixu 3707 days ago

I wrote one of these for fun a while back using the following approach:

- Files are indexed by inode and device, files with the same inode + device are considered equal. (My main use case for this was to bundle up JS files.)

- Files are then indexed by size; only files with the same size are compared.

- During comparison, the files are read at block sizes increasing in powers of two, starting with 2k. The blocks are hashed and compared, and if they do not match the comparison is stopped early (often without having to read the full file). If all the hashes are equal, then the files are considered to be equal.

- Hashes are only computed when needed and cached in memory. Since the hash block size increases in powers of two, only a few dozen hashes are needed even for large files (reducing memory usage compared to a fixed hash block size).

link: https://github.com/mixu/file-dedupe

2 comments

_swa8 3707 days ago

I wrote mine along similar lines, except without using hashing at all. Files of identical size are compared byte-by-byte instead, until first difference or end of file. As many files as possible at a time, of course, to avoid having to read through files multiple times. This avoids any uncertainty about hash collisions.

To find out how many files of each size you have:

find ~ -type f -printf '%s\n' | sort | uniq -c | sort -n

link

hvidgaard 3707 days ago

A hash is often use for an online algorithm. If you know the hash you know there is a potential for dedupe, and you can do a byte for byte comparison. I suppose you could use size as a prefilter for dedupe. This is if you do dedupe on a file level. Dedupe on block level doesn't care about the content, only the blocks, and it's not unusual to see for instance mp3 files with the same mp3 stream but different metadata. You cannot do the latter without hashing.

link

pjc50 3707 days ago

The "compare by front first" method sounds quite like a prefix tree.

link