|
|
|
|
|
by palmy
2800 days ago
|
|
Thank you for the link to Jaynes' book! Really nice to see the different approaches. I'm intrigued by your comment on maximum entropy, as I personally struggle with the maximum entropy derivation due to the fact that we're using differential entropy ("continuous" entropy) to derive the Gaussian under constraints on the first and second moment. The differential entropy does not satisfy the same properties as entropy for a discrete distribution, some of which are the very properties that motivated entropy as a measure of information. Jaynes himself wrote a paper on this topic of continuous entropy in the 60s (can dig out the reference in the morning).
Even ignoring this, I also struggle a bit with "we're only constraining the two first and second moment". Why exactly one the first two? Why not the three first, etc.? One could say it's motivate by the fact that the Gaussian is the only distribution with finite nonzero moments, but that seems a bit handwavey? Would genuinely appreciate some input here, as the concept of Principle of Maximum Entropy is something I have a bit of trouble coming to terms with for the reasons described above (in general, mainly because choice of constraints is abritrary). |
|
As for the moment issue, the short story is that as you get into three or four moments, there isn't a general maximum entropy distribution anymore, except for some special idiosyncratic cases in the case of three I think. So the normal is, in some ways, the most conservative distribution you can have in a general, unspecified scenario sense. You can specify more moments, but then there isn't a single maxent distribution you can specify that would apply across all third and fourth-moment scenarios in the same way that would apply for the first two moments.
As for the continuous versus discrete thing, there's some caution that's warranted, but a lot of the maxent principles apply, and there are similar, closely related principles (minimum description length, which has been shown to be equivalent to maximum entropy inferentially in a sense) that generalize in the continuous case. If you think of everything as discretized (as is the case with machine representation), there's some work showing that the discretized and continuous cases are sort of related up to a constant (doi: 10.1109/TIT.2004.836702).
I realize this is a bit hand-wavy but it is a HN post.