Hacker News new | ask | show | jobs
by mreid 1025 days ago
This likelihood ratio approach highlights the fact that KL divergence is a member of the family of Csiszár F-divergences. These are measures of "distance" between distributions of the form E_Q[ F(p(x)/q(x)) ] where F is any convex function with F(1) = 0. This is kind of a generalization of log-likelihood where F kind of "weights" the badness of ratios different to 1. When F is -log you get KL divergence.

Another curious fact about KL divergences is they are also a Bregman divergence: take a convex function H and define B_H(P, Q) = \sum_x H(p(x)) - H(q(x)) - <∇H(q(x)), p(x) - q(x)>. These generalize pointwise square Euclidean distance. KL is obtained when H(P) is negative entropy \sum_x p(x) log p(x).

I spent a bunch of time studying divergences over distributions (e.g., see my blog post[1]) and in particular these two classes and the really neat fact about KL divergence is that it is essentially the only divergence that is both an F-divergence and a Bregman divergence. This is basically due to the property of log that turns logs of products into sums.

[1]: https://mark.reid.name/blog/meet-the-bregman-divergences.htm...