MD5 gives 16 bytes of output, so consider all possible 17 byte files and their MD5 checksums. On average, each checksum will be shared by 256 17 byte files.
If you're worried about MD5 collisions between files, adding the file size isn't going to do much to help. Better to use SHA1 or some other algorithm in addition to MD5. E.g. 16 bytes of MD5 + 20 bytes of SHA1 = 36 bytes total output.
No, because all hashing algorithms suffer from collisions. A perfect hash function for files would require space of at least a number of bits proportional to the size of your problem space.
In practice, you can go with SHA-2, for which no collisions have been found yet.
No matter which hash function you use, you will _not_ get 100% unique identifiers, because you _can't_!
Just do the freaking math, and you'll see for yourself - it's quite obvious, actually!
Yeah, I was confused. I'm a product guy, so math isn't my strongest point. It looks like combining Md5 + size + Sha1 should be enough to get a almost unique file id. Thank you all for the replies.
http://en.wikipedia.org/wiki/Pigeonhole_principle