Hacker News new | ask | show | jobs
by jochem9 902 days ago
The chances of having an accidental hash collision are really small.

I have build data warehouses with md5 as the hashing algorithm to generate keys from natural keys. Did some back of the envelope calculations back then and found that the chance of a hash collision was minute. Don't remember the exact numbers, but somewhere in the 100s of years if I was generating keys every second.

This could btw very well be a thing with large volumes of data, but in many systems this absolutely not a worry.

1 comments

The chance of a random collision is minute but if someone is actually building collisions the system is broken. DVC uses MD5 of file for reduplication for example and when you purposely inject files withe the same MD5 (which take seconds to build) the result is data loss.
md5 is faster due to being older and made for older hardware so I guess that is why it's in use for things like that. All deduplicating tools I have used first check for file length before it even tries to do a checksum so I guess that would take care of some problems. It's harder to find a collision if you have to keep the filesize the same.
MD5 was faster on old CPUs.

All modern enough CPUs have dedicated SHA-256 and SHA-1 instructions and even SHA-256 is much faster than MD5.

So performance cannot be invoked for continuing to use MD5.

The cheapest Intel Atom CPUs have added SHA-256 support in 2016 and AMD Zen 1 has added support in early 2017. The Intel Core CPUs have added support later, due to the delayed 10-nm transition, in 2019.

64-bit Arm has added support since the beginning, in 2012, which was what forced Intel to add SHA support to the Atom CPUs first, in order to not loose in the Geekbench benchmarks, where Atom CPUs were compared with Arm CPUs.

It's not harder to find a collision if you keep the file size the same, if you control both files at least.