Hacker News new | ask | show | jobs
by klodolph 3417 days ago
The concepts are a bit sloppy here. "Zero information" is a bit of hand waving which relies on the idea of having a uniform distribution on a set with infinite measure, which is not mathematically sound nor is it useful as an approximation.

All messages that you choose to transmit follow some probability distribution with each message having finite, nonzero probability. This is what the author gets wrong, on a very fundamental level. If this were not true, your messages would require an infinite number of bits to transmit.

Studies of the usefulness of hash functions are based on comparisons to an "ideal" hash function called a random oracle. The idea behind a random oracle is that you can't reverse it, but you can try to remember what outputs it gives for known inputs.

This "perfect hash function" has been extensively studied and still suffers from all of the flaws that the author mentions in this article. And yet people prove rather strong properties of cryptography systems under the random oracle assumption. So the author's criticisms, fortunately, have no bearing or relationship to real cryptography, because the good systems are already designed to be immune to such attacks.

I'm on mobile so I can't really give a play by play but the article is really based on nothing more than bad math.

1 comments

I think the article mixes up two things, namely the hash function business and uninformative priors.

Hash functions. In case of a good hash function the information about the actual distribution of the inputs is erased. That the hashes are uniformly distributed does not imply that the inputs are uniformly distributed which in turn renders any conclusion about the distribution of the inputs after applying a transformation invalid.

Uninformative priors. This seems to be more of a philosophical problem I don't really know much about. But as far as I can tell if one tries to quantify a lack of information in a naive way using probability distributions, then one gains information, for example that the value is uniformly distributed over some range, and this information has consequences like specific distributions of derived values.

So the attempt to quantify a lack of information turns into a self-defeating endeavor, not necessarily because of any inherent information but because of the information injected during the modeling process.

That is not a very precise way of describing the uniformity of hash functions, and it's not actually true. For example, if my input has some element with 50% probability, then the hashed output is going to have some element with at least 50% probability.

But this is well-known, and because it's well-known, no sane cryptographic system cares that hash outputs leak information that way. For example, look at PBKDF2, HMAC, or various asymmetric key authentication schemes.

You are right, I did not correctly word what I wanted to say. I was not thinking about repeated messages, I was only thinking about subsets of all possible messages. Say you are hashing HTML documents, then all your inputs will be in the subspace having <!DOCTYPE html><html as first characters but that information will be erased by the hash function.