| I'm not a mathematician but I think these are valid concerns. I'm only very superficially familiar with the continuous entropy debate. My understanding is that the continuous version is not quite as mathematically ironclad as the discrete version. The concept of continuous entropy seems to make be useful for reasoning about things. That's been good enough for me. I don't know if Shannon would approve. As for the idea of using only the first two moments, to me, that's just based on the very Bayesian idea of reducing the number of parameters you work with in order to make your models more easily learnable and computable. Most of the time, you only have enough data to do parameter estimates on a limited number of parameters. As you add more parameters, it gets much more difficult to learn as well as mathematically and computationally difficult to manipulate. You also get diminishing returns in terms of predictive power. "The blessing of abstraction", reduction in the number of parameters and possible states in models is the best we have to deal with the "curse of dimentionality". Or as Yudkowski puts it: "Our physics uses the same theory to describe an airplane, and collisions in a particle accelerator - particles and airplanes both obey special relativity and general relativity and quantum electrodynamics and quantum chromodynamics. But we use entirely different models to understand the aerodynamics of a 747 and a collision between gold nuclei. A computer modeling the aerodynamics of the 747 may not contain a single token representing an atom, even though no one denies that the 747 is made of atoms. A useful model isn't just something you know, as you know that the airplane is made of atoms. A useful model is knowledge you can compute in reasonable time to predict real-world events you know how to observe. Physicists use different models to predict airplanes and particle collisions, not because the two events take place in different universes with different laws of physics, but because it would be too expensive to compute the airplane particle by particle. " However, when you do have enough data, easy enough equations and enough computing power to deal with higher moments, by all mean do so! |
And regarding the moments, it's just that the normal distribution is the only distribution with a finite number of non-zero moments. Therefore, constraining higher order moments is not so straight forward.
Also might be worth noting that technically, if one was sufficiently UNreasonable, one could constrain the target distribution to take on specific values given specific inputs. This would not be very useful in any real-world applications. Then choosing between constraining only the first moment ("minimal" constraints), or constraining each point you've observed to take on the normalized frequency ("maximal" constraints) becomes entirely up to you. Therefore I don't see quite how maxent models give us the tools for deciding between complexity and accuracy, as the maxent models can be on either end of the spectrum depending on what constraints we choose.
(Unsure if you were implying that it did, but nonetheless it might be something to note.)