|
|
|
|
|
by borland
485 days ago
|
|
In order to check if a file is a duplicate of another, you need to check it against _every other possible file_. You need some kind of "lookup key". If we took the first 1024 bytes of each file as the lookup key, then our key size would be 1024 bytes. If you have 1 million files on your disk, then that's 128MB of ram just to store all the keys. That's not a big deal these days, but it's also annoying if you have a bunch of files that all start with the same 1024 bytes -- e.g. perhaps all the photoshop documents start with the same header. You'd need a 2-stage comparison, where you first match the key (1024 bytes) and then do a full comparison to see if it really matches. Far more efficient - and less work - If you just use a SHA256 of the file's contents. That gets you a much smaller 32 byte key, and you don't need to bother with 2-stage comparisons. |
|
I don't think it would be far more efficient to do hash the entire contents though. If you have a million files storing a terabyte of data, the 2 stage comparison would read at most 1GB (1 million * 1KB) of data, and less for smaller files. If you do a comparison of the whole hashed contents, you have to read the entire 1TB. There are a hundred confounding variables, for sure. I don't think you could confidently estimate which would be more efficient without a lot of experimenting.