Hacker News new | ask | show | jobs
by upofadown 902 days ago
The history of this makes it hard to convince people to supersede hashes based on the fact that they can be collided. If the legal community had switched to SHA-1 at the point that MD5 was found to be weak for collisions they would have had to consider switching over to SHA-2 10 years later. From their perspective they dodged a bullet.

There ends up being a usability issue here. An MD5 hash is only 128 bits long. So 32 hex digits. A SHA-2 hash is going to be 256 bits. Or 64 hex digits. Manually comparing 64 hex digits is in practice much harder than twice as hard as comparing 32 hex digits. People get lost in the middle. If you chop down your 256 bit hash to 128 bits then due to birthday collisions you can probably brute force a collision anyway (you end up only having to do something like 2^64 operations). So there ends up being a usability argument for specifying that your system has to be able to be secure in the face of collisions. At that point you could then further argue that you will just stick with MD5.

3 comments

A truncated SHA-256 is both more secure and also faster to compute on any modern CPU than MD5, and a visual comparison would work identically.

If a visual comparison is believed necessary, it should better be made easier, e.g. by overwriting the two hash values, using text of different colors.

Otherwise, even a bash script, or even just one bash command line can easily compare the output of two sha256sum executions and print an appropriate message.

If you have some sort of information processing system available to compare hashes, then you would be better off comparing the data directly. The hashes when used in this context are primarily to make things like manual comparisons possible. Usability is the point.
> There ends up being a usability issue here. An MD5 hash is only 128 bits long. So 32 hex digits. A SHA-2 hash is going to be 256 bits. Or 64 hex digits. Manually comparing 64 hex digits is in practice much harder than twice as hard as comparing 32 hex digits. People get lost in the middle.

Comparing MD5 and SHA-2 for visual human diffing is like comparing a stick of dynamite to a landmine when trying to pop a pimple; any potential safety differences are trivial once you start using something in a fundamentally unsafe way.

Running 2^64 SHA1 ops on a GPU takes 15 years, so I think finding a reasonable collision for that half using SHA2/3 is not as trivial as you suggest:

https://crypto.stackexchange.com/questions/84520/how-long-wo...

A chosen-prefix (i.e. a demonstration how to modify a given legal document to obtain two different documents that have the same hash) SHA-1 collision has already been computed:

"We have successfully run the computation during two months last summer, using 900 GPUs (Nvidia GTX 1060)."

Such resources can be easily rented from a cloud.

So for anyone willing to spend up to 100k USD, it is trivial to find SHA-1 collisions.

https://sha-mbles.github.io/

Since that post was published, the 4090 came out, which can (according to this hashcat benchmark [1]) do 50,638.7 million SHA1 hashes per second, so now it would only take a single 4090 GPU 11.55 years. Or you could buy 12 of them and do it in a year, etc. So it's definitely not cheap but 15 years is definitely an overestimate (and presumably GPUs will keep getting faster...).

SHA2-256 is "only" 21975.5 MH/s so you'd have to double the number of GPUs or amount of time.

[1] https://gist.github.com/Chick3nman/32e662a5bb63bc4f51b847bb4...

Generating 2^64 hashes isn't guaranteed to produce a collision, and even if a collision did exist in that set, you're not going to find it by getting a bunch of GPUs to compute 2^64 hashes. There's a huge difference between a haystack that maybe contains a needle, and a needle that's been pulled from the haystack and presented to you. To actually find and identify the collisions you'll need to hook those GPUs up to some sort of storage/retrieval system. Just to store 2^64 128-bit hashes would take 295.1 exabytes. That's an order of magnitude more storage than NSA's utah datacenter[1].

[1] https://en.wikipedia.org/wiki/Utah_Data_Center

You can generate pairs of hashes for random inputs and check for collision without storing all of the outputs, no?
True, which is why 256-bit or bigger hashes are recommended to make sure that the time for finding a collision is alone great enough to make this impossible.

Finding a collision for an 160-bit hash by brute force, on a big cloud or supercomputer, is about the maximum that someone with unlimited financial resources could do.

That requires way more hashes to be computed though.
I don't think that's correct? It's probabilistic, yes, but in expectation you would still need ~2^64 hashes to find a collision for a 128-bit hash (birthday paradox).