|
|
|
|
|
by redsaz
955 days ago
|
|
This isn't the first time I've heard concerns about using hashes for checking file equality. I've considered adding a "paranoid mode" to do the direct byte-for-byte checks for such folks that don't even want to introduce a so-remote-it's-virtually-impossible theoretical chance for a collision to occur. I'd go into the math about how remote of a chance it would be (barring any discovered hash collisions) but others have explained it better than me elsewhere. |
|
But as I tried to describe (probably in way too much detail) the real problem with "hash everything, compare hashes afterwards" is that it implies that you'll be doing I/O to read all of the file's contents even when it isn't needed to prove uniqueness. For a lot of common use cases (big files, few dupes) this can mean doing 1000x more I/O than you need.
Once you design the solution around avoiding unneeded I/O, you find that hashing also stops being useful.