Hacker News new | ask | show | jobs
by happy4crazy 2929 days ago
I personally don't find the "bits"y explanation of entropy/cross-entropy/KL etc. to be all that intuitive; as fundamental as it may be, I don't think about compression/encodings all that often. I've always preferred the "surprise" interpretation: http://charlesfrye.github.io/stats/2016/03/29/info-theory-su...

In short: given some event of probability p, -log p = log 1/p is its "surprise". (If p = 1, log 1/1 = 0, so zero surprise; as p -> 0, the surprise gets bigger and bigger; and the surprise for two independent events, p = p1 * p2, is the sum of their individual surprises: log 1/(p1*p2) = log 1/p1 + log 1/p2.)

The entropy of a distribution is its average surprise: Sum/Integral of p log 1/p.

KL(p || q) is your excess surprise if you think something's distribution is q but it's actually p: Sum/Integral p (log 1/q - log 1/p). The KL divergence is always non-negative because surely if you think the distribution is q but it's actually p, on average you're going to be more surprised than someone who knows it's p.

1 comments

If you're introducing a new term solely for the sake of explaining something, then your fundamentals are wrong.

Bits are fundamental to understanding why we can encode simple numbers in a GNN. If you don't understand that, then surprise-surprise - you need to create another, possibly misleading further down the line, framework.