Ask HN: md5 + size = file UID?

Y	Hacker News new \| ask \| show \| jobs

	Ask HN: md5 + size = file UID?
	1 points by alexrodygin 5798 days ago
	Is md5 checksum plus the file's size is absolutely unique identifier for a file across the universe?

5 comments

paulgb 5798 days ago

No, for the same reason you can't fit n+1 pigeons in n holes.

http://en.wikipedia.org/wiki/Pigeonhole_principle

link

mooism2 5798 days ago

No, it isn't.

MD5 gives 16 bytes of output, so consider all possible 17 byte files and their MD5 checksums. On average, each checksum will be shared by 256 17 byte files.

If you're worried about MD5 collisions between files, adding the file size isn't going to do much to help. Better to use SHA1 or some other algorithm in addition to MD5. E.g. 16 bytes of MD5 + 20 bytes of SHA1 = 36 bytes total output.

link

mfukar 5798 days ago

No, because all hashing algorithms suffer from collisions. A perfect hash function for files would require space of at least a number of bits proportional to the size of your problem space.

In practice, you can go with SHA-2, for which no collisions have been found yet.

link

_0ffh 5798 days ago

No matter which hash function you use, you will _not_ get 100% unique identifiers, because you _can't_! Just do the freaking math, and you'll see for yourself - it's quite obvious, actually!

link

alexrodygin 5796 days ago

Yeah, I was confused. I'm a product guy, so math isn't my strongest point. It looks like combining Md5 + size + Sha1 should be enough to get a almost unique file id. Thank you all for the replies.

link

cperciva 5798 days ago

Is md5 checksum plus the file's size is absolutely unique identifier for a file across the universe?

No. Use SHA256.

link