Hacker News new | ask | show | jobs
by StillBored 902 days ago
Sigh, been having this conversation in a related codebase. Md5 is just as fine as any other generic hash function if its being used as a non-unique key, which for many cases replacing it with one of the more "secure" alternatives does nothing except for the fact that the resulting hashes are frequently longer, thereby further reducing the statistical chance of an accidental collision. For something like a document store, duplication system, etc, simply taking the extra step of doing a binary comparison against the text associated with the hash assures that accidental (or intentional) collisions are handled. With the bonus that you probably get to either publish a paper or detect someone trying to attack the system should the text comparison fail.

And given the history of cryptographic hashes, i'm even more convinced that anyone depending on sha3/whatever being better than md5/etc over the next 10-20 years is fooling themselves.

Now would I use it in a secure boot chain/etc as a stamp of uniqueness? Probably not.

3 comments

> Md5 is just as fine as any other generic hash function if its being used as a non-unique key

MD5 brings the feature that you'll forever be explaining why you chose a function that had already been broken for 30 years when other options were readily available.

Often, its less about picking it for a new project vs having the discussion about how to update an existing one, often with data at rest that needs converting. In the latter case often the hash needs to fit into the existing 128b field size, so one is throwing a good number of the SHA bits away anyway.
This is not a justification for using a broken function.

Truncating a fixed number of bits does not make a good secure function less secure, other than the implications that the shorter length has on brute-force and collision strength.

In some cases it can even make it even more secure, e.g. SHA-2-512/256.

Uh, truncation is a huge no-no for security. But, that isn't what I was talking about. For deduplication/hash key/etc kinds of functions where a checksum works (or some other algorithm with a high chance of collisions is implemented for unit testing to assure collision paths are handled correctly).

This is part of the entire discussion, it's possible to use algorithms originally intended for cryptographically secure purposes in places where the reversibility/collision properties aren't considered problems. And likely this will only happen more frequently as algorithms are picked because they have hardware acceleration and are fast rather for the underlying security properties.

BTW: There is a fun game that I've played; see how much you can truncate a modern cryptographic hash before it becomes trivial to find collisions.

The chances of having an accidental hash collision are really small.

I have build data warehouses with md5 as the hashing algorithm to generate keys from natural keys. Did some back of the envelope calculations back then and found that the chance of a hash collision was minute. Don't remember the exact numbers, but somewhere in the 100s of years if I was generating keys every second.

This could btw very well be a thing with large volumes of data, but in many systems this absolutely not a worry.

The chance of a random collision is minute but if someone is actually building collisions the system is broken. DVC uses MD5 of file for reduplication for example and when you purposely inject files withe the same MD5 (which take seconds to build) the result is data loss.
md5 is faster due to being older and made for older hardware so I guess that is why it's in use for things like that. All deduplicating tools I have used first check for file length before it even tries to do a checksum so I guess that would take care of some problems. It's harder to find a collision if you have to keep the filesize the same.
MD5 was faster on old CPUs.

All modern enough CPUs have dedicated SHA-256 and SHA-1 instructions and even SHA-256 is much faster than MD5.

So performance cannot be invoked for continuing to use MD5.

The cheapest Intel Atom CPUs have added SHA-256 support in 2016 and AMD Zen 1 has added support in early 2017. The Intel Core CPUs have added support later, due to the delayed 10-nm transition, in 2019.

64-bit Arm has added support since the beginning, in 2012, which was what forced Intel to add SHA support to the Atom CPUs first, in order to not loose in the Geekbench benchmarks, where Atom CPUs were compared with Arm CPUs.

It's not harder to find a collision if you keep the file size the same, if you control both files at least.
Supposing I did want to use the file hash as a unique key and I really don't want to do a byte for byte comparison... And I care about speed but not so much about bad actors, what should I use?
Didn’t think about it much, but file size should be a good indicator if the hash isn’t horrible. md5 + file size comparison could work for your use-case.
One of the inputs for MD5 is the length of the message, so I'm at least wrong in the case of MD5. Don't know about the general case and although I'm interested in the answer I can't spend time on it right now. But if anyone has a pointer to a useful resource please reply.

In general, just use a good hash function.