|
|
|
|
|
by happy4crazy
2929 days ago
|
|
I personally don't find the "bits"y explanation of entropy/cross-entropy/KL etc. to be all that intuitive; as fundamental as it may be, I don't think about compression/encodings all that often. I've always preferred the "surprise" interpretation: http://charlesfrye.github.io/stats/2016/03/29/info-theory-su... In short: given some event of probability p, -log p = log 1/p is its "surprise". (If p = 1, log 1/1 = 0, so zero surprise; as p -> 0, the surprise gets bigger and bigger; and the surprise for two independent events, p = p1 * p2, is the sum of their individual surprises: log 1/(p1*p2) = log 1/p1 + log 1/p2.) The entropy of a distribution is its average surprise: Sum/Integral of p log 1/p. KL(p || q) is your excess surprise if you think something's distribution is q but it's actually p: Sum/Integral p (log 1/q - log 1/p). The KL divergence is always non-negative because surely if you think the distribution is q but it's actually p, on average you're going to be more surprised than someone who knows it's p. |
|
Bits are fundamental to understanding why we can encode simple numbers in a GNN. If you don't understand that, then surprise-surprise - you need to create another, possibly misleading further down the line, framework.