Poisonous MD5 – Wolves Among the Sheep | HN Mirror

Y	Hacker News new \| ask \| show \| jobs

	Poisonous MD5 – Wolves Among the Sheep (blog.silentsignal.eu)
	110 points by 4mnt 4023 days ago

5 comments

WalterGR 4022 days ago

The relevance of the article's mention of the "Flame" malware was puzzling, since no context is provided and the linked Wired article doesn't shed any light.

Wikipedia has this to say, which seems to solve that puzzle:

"Flame was signed with a fraudulent certificate purportedly from the Microsoft Enforced Licensing Intermediate PCA certificate authority. The malware authors identified a Microsoft Terminal Server Licensing Service certificate that inadvertently was enabled for code signing and that still used the weak MD5 hashing algorithm, then produced a counterfeit copy of the certificate that they used to sign some components of the malware to make them appear to have originated from Microsoft. A successful collision attack against a certificate was previously demonstrated in 2008, but Flame implemented a new variation of the chosen-prefix collision attack."

http://en.m.wikipedia.org/wiki/Flame_%28malware%29

WalterGR 4022 days ago

Whoops - I didn't remember that Wikipedia uses a separate domain for mobile browsers.

Here's the 'real' link: http://en.wikipedia.org/wiki/Flame_%28malware%29

aylons 4022 days ago

I'm no security expert, but I have a question.

In some systems I've built in the past I employ MD5 as a hashing mechanism to verify firmware integrity after flashing it in the memory. I don't use MD5 for anything security related (this is treated in other ways, depending on the system), just to check transmission and memory integrity.

Is MD5 still considered fine for this, or is there a real risk that random or systematic (but unintentional) noise could generate a collision between corrupted and original data? I do believe it should suffice, but hearing all the badmouth makes me wonder...

andrew-lucker 4022 days ago

MD5 is good enough to prevent most random collisions. The problem is when you need to prevent intentional collisions.

innocenat 4022 days ago

As a note, even CRC32 is enough to check most random collisions.

aylons 4022 days ago

I actually thought of using CRC32 when developing my first system, but I preferred MD5 because it would be easier and clearer for people to check file integrity in a *NIX command line.

cmdrfred 4022 days ago

I'm no expert either but as I recall if you verify the length as well it should be almost impossible.

dperfect 4023 days ago

Aren't all hashing algorithms vulnerable to the possibility for collisions (albeit with different degrees of difficulty)? It sounds like the problem here is more related to the logic that relies on a hash alone to make important decisions.

Not saying that MD5 is a good choice in this case, just that we may be blaming the wrong thing.

mikeash 4023 days ago

Your parenthetical is the key, though. There's a big difference between a hash algorithm where generating a collision requires a few minutes of work on a cheap computer (MD5, now) and a hash algorithm where generating a collision requires a computer the size of the universe operating for a trillion trillion years (any good cryptographically secure hash).

dperfect 4023 days ago

Cool - didn't realize the difference was so great. I've always known that the good algorithms are better because they're more difficult to brute-force, but always wondered if it's just a matter of a few years before the "impossible" becomes possible. Your illustration helps clarify that improbability in my mind - thanks!

mikeash 4023 days ago

Glad to be of service. This seems to be a common misconception.

To put some numbers on it, a good cryptographic hash should produce random-looking output with no way to predictably influence the output, and no visible correlation between the input and the output. Put in A, get out "random number" B, except that every time you put in the same A, you get out the same B.

If you have a hash that matches that description, then your only hope is brute force, so you look at the size of the hash. If you're breaking SHA-256, for example, that's 256 "random" bits.

If you have a specific hash that you want to generate, it will take you on average 2^255 attempts. (On average you have to try half the possibilities before you find a match.) If your computer can do a billion hashes per nanosecond then that means you'll spend on average 2^255 / (10^9 * 10^9) seconds, or about 10^41 times the current age of the universe.

If you just want to find any collision, then the birthday paradox comes into play, and on average you'll need to try roughly the square root of the number of possibilities before you find a collision. At a billion per nanosecond that's 2^128 / (10^9 * 10^9) second, or a mere 780 times the current age of the universe. Plus you have to figure out how to store all those intermediate results.

If MD5 fit this property it would still be good. Not as good as SHA-256, because it's only 128 bits, but plenty sufficient. 128 bits with the hypothetical billion hashes per nanosecond gives you 390 times the current age of the universe to find a specific hash. A collision is easier, at 18 seconds, but a billion hashes per nanosecond is also orders of magnitude faster than you'll realistically be able to do, and you'll need 2^64 * 128 bits = 300 exabytes of storage.

The problem with MD5 is that it does not fit this "random number" property. You can manipulate the input with somewhat predictable results in the output if you're clever, and that means you can generate a collision much easier than it would require for brute force.

My understanding is that more modern hashes are believed/hoped to have this property, but it's unknown. And not only that, but it's unknown whether any hash could exist with that property, or whether it's a theoretically impossible goal.

granos 4023 days ago

Also, even it were provably true that some hash algorithm has this property, we would have to implement it properly. That requires the algorithm itself as well as all of its dependencies to be implemented properly. It's very easy to make an implementation mistake that breaks security.

mikeash 4023 days ago

Is that typically a problem for a plain hash? There's no random data to be leaked or generated poorly, and there's typically not going to be any private information that could be leaked through a timing attack.

__z 4023 days ago

>always wondered if it's just a matter of a few years before the "impossible" becomes possible.

This is from 1998 but the relevant parts - https://www.schneier.com/essays/archives/1998/05/the_crypto_...

>Cryptographic algorithms have a way of degrading over time. It's a situation that most techies aren't used to: Compression algorithms don't compress less as the years go by, and sorting algorithms don't sort slower. But encryption algorithms get easier to break; something that sufficed three years ago might not today.

>Cryptographic algorithms are all vulnerable to brute force--trying every possible encryption key, systematically searching for hash-function collisions, factoring the large composite number, and so forth--and brute force gets easier with time. A 56-bit key was long enough in the mid-1970s; today that can be pitifully small. In 1977, Martin Gardner wrote that 129-digit numbers would never be factored; in 1994, one was.

>Aside from brute force, cryptographic algorithms can be attacked with more subtle (and more powerful) techniques. In the early 1990s, the academic community discovered differential and linear cryptanalysis, and many symmetric encryption algorithms were broken. Similarly, the factoring community discovered the number-field sieve, which affected the security of public-key cryptosystems.

DES was used in the 70s, now it can be brute forced in a few days (with the right hardware).

dperfect 4023 days ago

I suppose one could say this is an argument against using the "computer the size of the universe operating for a trillion trillion years"-type illustrations. Statements like that reflect the current theoretical strength of an algorithm, but unfortunately the illustrations can lead us to (wrongly) assume that flaws won't be discovered in the algorithm for that period of time, which is very much untrue and undermines the practical implications of those statements.

__z 4023 days ago

No arguments from me that cyptography is often explained confusingly which leads to misunderstandings.

syncsynchalt 4023 days ago

To reinforce your point, DES is now crackable by the public in under a day: https://www.cloudcracker.com/blog/2012/07/29/cracking-ms-cha...

rictic 4023 days ago

MD5 is also supposed to require a tremendous amount of computation, it's just that there are weaknesses in the algorithm that were eventually discovered, such that an attacker can short circuit much of that work. The reason that SHA-1 is being sunset is that some preliminary results by researchers suggest that it also has weaknesses that can enable it to be short circuited, and that researchers appear to be on a path to discovering those weaknesses.

dperfect 4023 days ago

Hmmm... so the takeaway for me is that when designing critical systems that rely on hashes/fingerprints as identifiers, we should probably treat those as transient identifiers with a reasonable expectation of migrating them as newer algorithms replace older broken ones. Does that sound right?

Diederich 4023 days ago

That's correct.

userbinator 4022 days ago

The story of how the MD5 collision pathways were discovered is an interesting one - the first colliding pair is said to have been calculated by Xiaoyun Wang by hand, with little computer use. She has also figured out similar methods for SHA-1 and a few other hashes.

http://en.wikipedia.org/wiki/Wang_Xiaoyun

gkoz 4023 days ago

Did breaking MD5 require a computer the size of the universe 20 years ago?

mikeash 4023 days ago

It was thought to, which is why it became so popular. It turned out to be wrong.

astrodust 4023 days ago

Not really. It was cracked by a distributed network of ordinary computers were people volunteered CPU time.

pbsd 4023 days ago

MD5's compression function was already broken 20 years ago: http://cseweb.ucsd.edu/~bsy/dobbertin.ps

malka 4023 days ago

Yeah, but MD5 is broken in the sense that you can generate a collision on purpose.

kedean 4023 days ago

The collision-resistance property that all good hashes should have (and md5 lacks) states that an attacker with an input and its hash cannot arbitrarily produce a second input with the same hash. The possibility of it happening in the wild will always exist with hashes by their finite nature, but the only way an attacker should be able to find collisions is by enumerating the input space (rainbow table generation).

jimrandomh 4023 days ago

No, the property you describe is called "preimage resistance". Collision resistance is stronger; it states that an attacker should not be able to create a pair of inputs with the same hash. In the case of md5, creating a pair of inputs with the same hash is easier than creating another input with the same hash as something else which you didn't yourself generate.

The MD5 algorithm is known to lack collision resistance, but whether it has preimage resistance is less certain; mathematical advances have weakened its preimage resistance, but not yet to the point of demonstrating a practical preimage attack.

garrettr_ 4023 days ago

Nitpick: the property OP describes is actually second pre-image resistance, not preimage resistance (or collision resistance). See https://en.wikipedia.org/wiki/Cryptographic_hash_function#Pr...

kedean 4023 days ago

My mistake, I always mixed those two up. Both properties address OP, though, as MD5 is not suspected to have preimage resistance either (it's just not to the point of somebody having done it yet).

4mnt 4023 days ago

> In the case of md5, creating a pair of inputs with the same hash is easier than creating another input with the same hash as something else which you didn't yourself generate.

This is the case with all instances of seeking a collision, due to the birthday paradox [0]

0: https://en.wikipedia.org/wiki/Birthday_attack

cstejerean 4023 days ago

The birthday paradox helps with the case of finding any two random inputs that have the same hash. The problem with MD5 is that it's feasible to craft two specific inputs that happen to have the same hash.

hinkley 4023 days ago

Yes, but the complexity of finding a collision in SHA1 is about 2^14 higher than MD5, and even SHA1 is being sunset by many people.

cm2187 4023 days ago

I don't get why it is a security problem that someone can manufacture false positives for an anti-virus. What is the benefit for a virus to have non-malicious code caught by the anti-virus?

False negatives would be more of an issue if the anti-virus has white lists and one can manufacture a Microsoft Excel MD5 signature with a malware. But that's not what the article refers to.

MD5 is only broken if you want to use it as a non-reversible hashing algorithm or if you want to use it as a an unforgeable signature. But it's perfectly fine for many other usage.

bariumbitmap 4023 days ago

From the article:

  As you can see, binaries submitted for analysis are
  identified by their MD5 sums and no sandboxed execution is
  recorded if there is a duplicate (thus the shorter time
  delay). This means that if I can create two files with the
  same MD5 sum – one that behaves in a malicious way while the
  other doesn’t – I can “poison” the database of the product
  so that it won’t even try to analyze the malicious sample!

So it's a technique to get the scanner to ignore a malicious binary by constructing a non-malicious one with the same MD5 sum. This would be much harder if the scanner used a SHA-1 hash or similar.

cm2187 4023 days ago

But that's a white list. But I thought anti-virus rather work by black listing.

Too 4023 days ago

virustotal.com allows you to upload files to scan with a whole range of anti-virus programs. Before uploading, it will calculate the hash of your file client-side to see if the file should be uploaded or if a previously uploaded (by someone else) file with same hash should be re-scanned with newer versions of the anti-virus.

I don't know which hashing algorithm they use but just as example of a situation where whitelist is not used.

bariumbitmap 4022 days ago

Yes, I think that's what the author was alluding to here, although I'm not sure:

  The approach may work with traditional AV software too as
  many of these also use fingerprinting (not necessarily MD5)
  to avoid wasting resources on scanning the same files over
  and over (although the RC4 encryption results in VT 0/57
  anyway…).

sarciszewski 4023 days ago

sha256sum or b2sum (BLAKE2b) would be far better than sha1 :)

jimrandomh 4023 days ago

You misunderstand. The researchers are presenting a way to manufacture false negatives for an anti-virus. It works by confusing antivirus vendors' infrastructure into thinking it's already analyzed an executable and found it to be innocent when it's really analyzed something else.

snorkel 4023 days ago

The attack vector would be malware binary crafted to have the same MD5 sig as a popular already trusted app. But of course once the badware is caught virus scanners could check other properties aside from MD5 sig to flag a bad binary. I assume virus scanners use MD5 just a fast prescreen scan, then do a few deeper checks on pototentially bad binaries to make sure.

ryan-c 4023 days ago

What you describe there would be a preimage attack[0], not a collision attack. There is no publicly known practical[1] preimage attack on MD5 at this time.

0. http://en.wikipedia.org/wiki/Preimage_attack

1. 2^123.4 complexity is not practical

taumhn 4023 days ago

If I understand correctly, what's described in the article indeed uses a collision attack. The "trusted app" is one the hacker manufactures himself.

ryan-c 4022 days ago

Yes, that was what the article says, but the person I was responding to misunderstood.

makomk 4022 days ago

This is actually one of the older and easier attacks against MD5; we've known this was possible for over a decade. Nowadays it's actually possible to so chosen prefix attacks - you can literally take two arbitrary, unrelated files and append some data that makes them have the same MD5. So you don't even have to include the malicious code in the decoy file in any form anymore.