Hacker News new | ask | show | jobs
by mellavora 1520 days ago
The typical measure of entropy (Shannon or Gibbs, and let's spare details for later and after you've read up on the theory of large deviations) is

- sum (p log(p))

which is not that different than the formula for the mean

sum (p 1/n)

the critical difference is the normalization constant is based on the probability of the state rather than assuming a uniform probability over all states.

So, in effect, the entropy is a measure of the mean. It is a measure adopted to the case where "mean" is ill-defined because the number of modes and/or the variation around those modes is not handled well by simpler metrics.

3 comments

If there was anyone who taught you this then they should be fired.

More constructively, principal among the many things wrong with your comment is the formula for the mean; sum_i p_i = 1, so sum_i p_i / n = 1 / n. The mean would instead be sum_i p_i x_i.

Perhaps I'm misunderstanding or missing something, but I'm afraid this seems completely wrongheaded to me. (My apologies for being so blunt, but right now your comment appears to be the most-upvoted, and I therefore think it needs some pushback.)

[EDITED to add: I was looking at an old version of the page; by the time I wrote this the parent was no longer the top comment. I'll leave the bluntness in, especially as at least one other person was even blunter.]

You refer to "the mean" and I think you mean the mean of the probabilities. Now, when you've got a probability distribution, by far the usual thing for "the mean" to mean is the sum of Pr(x) x -- the mean of the values. Taking the mean of the probabilities is a really strange thing to do.

One reason why it's a really strange thing to do is that this thing you call n is really kinda meaningless. There's no difference between these two probability distributions: (a) 1, 2, 3, or 4, with probabilities 0.1, 0.2, 0.3, 0.4 respectively; (b) 1, 2, 3, 4, or 5, with probabilities 0.1, 0.2, 0.3, 0.4, 0 respectively. But (a) has n=4 and (b) has n=5. Maybe you want n to be the number of nonzero probabilities? But now consider (a) along with the following probability distribution parameterized by a (small, positive) number h: 1, 2, 3, 4, or 4+h, with probabilities 0.1, 0.2, 0.3, 0.4-h, h. Every version of this distribution with h>0 has n=5, but when h is very small it's practically indistinguishable from (a) with n=4.

Further, since the sum of probabilities is always 1, what you write as sum (p 1/n) is just the same as the number 1/n. You can call it "the mean" if you want to, but I don't see what this adds over calling it what it is: the reciprocal of the number of possibilities.

There is something to what you say: the entropy is kinda related to the number of possibilities; if the probabilities are all equal, the entropy is log(#possibilities); if the probabilities are equal-ish then it's modestly smaller than that. But note e.g. that this relationship is exactly the inverse of what you say, in that "the mean" decreases with the number of possibilities, and the entropy increases with the number of possibilities.

The entropy is not "a measure of the mean". It kinda-sorta is related to "the number of possibilities", which is the reciprocal of "the mean". It is not at all the case, as your last paragraph suggests, that for most purposes we should be using "the mean" but we need to use the entropy when "the number of modes ... is not handled well by simpler metrics", whatever that means; for most purposes we should be using the entropy, and in the special case where all the probabilities are equal we can get away with just counting possibilities.

(In some important situations it turns out that what you have is some number of possibilities with roughly equal probabilities, and a whole lot more whose probabilities rapidly decrease to almost zero, and then you can get away with counting the number of reasonably-probable possibilities and taking its log. E.g., various situations in communications theory can fruitfully be thought of this way. But the entropy is still the more fundamental quantity, and "the mean" is still a needless obfuscation of "the (effectively) number of possibilities".)

It can be related to compression. If some phrase has a probability p_i of occuring, then the optimal length for the code is -log(p_i). The entropy sum(-p_i log_pi) = mean(-log(p_i)) is how long code you will use on average.