Hacker News new | ask | show | jobs
by palmy 2800 days ago
Thank you for the link to Jaynes' book! Really nice to see the different approaches.

I'm intrigued by your comment on maximum entropy, as I personally struggle with the maximum entropy derivation due to the fact that we're using differential entropy ("continuous" entropy) to derive the Gaussian under constraints on the first and second moment. The differential entropy does not satisfy the same properties as entropy for a discrete distribution, some of which are the very properties that motivated entropy as a measure of information. Jaynes himself wrote a paper on this topic of continuous entropy in the 60s (can dig out the reference in the morning). Even ignoring this, I also struggle a bit with "we're only constraining the two first and second moment". Why exactly one the first two? Why not the three first, etc.? One could say it's motivate by the fact that the Gaussian is the only distribution with finite nonzero moments, but that seems a bit handwavey?

Would genuinely appreciate some input here, as the concept of Principle of Maximum Entropy is something I have a bit of trouble coming to terms with for the reasons described above (in general, mainly because choice of constraints is abritrary).

3 comments

There's kind of two issues at least. One is the continuous-discrete issue and the other is the moment issue.

As for the moment issue, the short story is that as you get into three or four moments, there isn't a general maximum entropy distribution anymore, except for some special idiosyncratic cases in the case of three I think. So the normal is, in some ways, the most conservative distribution you can have in a general, unspecified scenario sense. You can specify more moments, but then there isn't a single maxent distribution you can specify that would apply across all third and fourth-moment scenarios in the same way that would apply for the first two moments.

As for the continuous versus discrete thing, there's some caution that's warranted, but a lot of the maxent principles apply, and there are similar, closely related principles (minimum description length, which has been shown to be equivalent to maximum entropy inferentially in a sense) that generalize in the continuous case. If you think of everything as discretized (as is the case with machine representation), there's some work showing that the discretized and continuous cases are sort of related up to a constant (doi: 10.1109/TIT.2004.836702).

I realize this is a bit hand-wavy but it is a HN post.

Thank you, I really appreciate the response. This was useful.

I do see the reasoning for choosing the normal due to it being the only distribution with finite non-zero moments, and thus, as you nicely pointed out, constraints on a finite number of higher order moments will not give a unique distribution.

But, due to the issues we've now mentioned, I find myself a bit uneasy wrt. maxent as a derivation of and/or as an explanation of the ubiquity of the normal distribution. Thus I find myself more comfortable with some of the other derivations demonstrated by Jaynes.

And thank you for the paper reference; will have a proper look at it sometime. It might be related to

I enjoyed reading the chapter but I didn't have enough time to put into it to understand all his derivations as well as I would like. So I may be incorrect here but I don't think he is proving the gaussian distribution is correct, just that it is a good (or the best) one to use.

Does someone have a dart board? It would be nice to take a look at some real data. Maybe 20 throws? Or 200?

I don't think the results will fit a gaussian particularly well. I think there will more darts at large distances than expected. For that matter I would guess, if there is enough data, that the mean would be slightly below the maximal likelihood (as in closer to the floor).

Jaynes usually tries to move away from an interpretation of probability distributions being "correct" as in representing a fact about the world, and towards a definition that is more about a state of knowledge and uncertainty about the world. Distributions are a property of a model or of a knowledgeable agent, not of an object or situation. See Chapter 10: "Physics of 'Random Experiments'". However the two definitions sorta become indistinguishable when you are dealing with experiments that are repeated enough times.
I'm not a mathematician but I think these are valid concerns. I'm only very superficially familiar with the continuous entropy debate. My understanding is that the continuous version is not quite as mathematically ironclad as the discrete version. The concept of continuous entropy seems to make be useful for reasoning about things. That's been good enough for me. I don't know if Shannon would approve.

As for the idea of using only the first two moments, to me, that's just based on the very Bayesian idea of reducing the number of parameters you work with in order to make your models more easily learnable and computable. Most of the time, you only have enough data to do parameter estimates on a limited number of parameters. As you add more parameters, it gets much more difficult to learn as well as mathematically and computationally difficult to manipulate. You also get diminishing returns in terms of predictive power. "The blessing of abstraction", reduction in the number of parameters and possible states in models is the best we have to deal with the "curse of dimentionality".

Or as Yudkowski puts it:

"Our physics uses the same theory to describe an airplane, and collisions in a particle accelerator - particles and airplanes both obey special relativity and general relativity and quantum electrodynamics and quantum chromodynamics. But we use entirely different models to understand the aerodynamics of a 747 and a collision between gold nuclei. A computer modeling the aerodynamics of the 747 may not contain a single token representing an atom, even though no one denies that the 747 is made of atoms.

A useful model isn't just something you know, as you know that the airplane is made of atoms. A useful model is knowledge you can compute in reasonable time to predict real-world events you know how to observe. Physicists use different models to predict airplanes and particle collisions, not because the two events take place in different universes with different laws of physics, but because it would be too expensive to compute the airplane particle by particle. "

However, when you do have enough data, easy enough equations and enough computing power to deal with higher moments, by all mean do so!

Most certainly. I did not intend to question the usefulness of maxent models. I just find myself a bit uneasy with maxent as a derivation of and/or explanation of the ubiquity of the normal distribution, given the two issues mentioned above when we're talking about the continuous case. I was wondering if you might have some insight into the issue which could remedy this feeling of uneasiness :)

And regarding the moments, it's just that the normal distribution is the only distribution with a finite number of non-zero moments. Therefore, constraining higher order moments is not so straight forward.

Also might be worth noting that technically, if one was sufficiently UNreasonable, one could constrain the target distribution to take on specific values given specific inputs. This would not be very useful in any real-world applications. Then choosing between constraining only the first moment ("minimal" constraints), or constraining each point you've observed to take on the normalized frequency ("maximal" constraints) becomes entirely up to you. Therefore I don't see quite how maxent models give us the tools for deciding between complexity and accuracy, as the maxent models can be on either end of the spectrum depending on what constraints we choose.

(Unsure if you were implying that it did, but nonetheless it might be something to note.)