Thanks for sharing! Despite the fact that Shannon's "A Mathematical Theory of Communication" is so accessible, I find that most in our field (stats/ML) don't often think through information-theoretic tools in a "first principles way."
Yes, KL divergences show up everywhere, but they are not derived from scratch often enough. Maybe I'm stifled by my campus bubble though :)
Here's a quote of a tweet about a (my own): comment on a schema:BlogPost: https://twitter.com/westurner/status/1048125281146421249:
> “When Bayes, Ockham, and Shannon come together to define machine learning” https://towardsdatascience.com/when-bayes-ockham-and-shannon...
> Comment: "How does this relate to the Principle of Maximum Entropy? How does Minimum Description Length relate to Kolmogorov Complexity?"