I disagree that you need a solid founding in information theory. Almost all that I've seen about IT in ML is minimizing the KL divergence, which can be learned by browsing the wiki page.
Well, information theory isn't much more than the logarithm of probability theory, so it doesn't hurt to learn it anyway. The only thing you need to know is that given a probability distribution P there exist a compression scheme to encode a value X with a message of P_length(X) = log(1/P(X)) bits. This can be summarised as BITS = log(1/PROBABILITY). Entropy is just the average number of bits you need to encode a random value from distribution P with the compression scheme of distribution P, i.e. E_P[P_length(X)]. The KL(P,Q) divergence is when you encode a random value from distribution P with the compression scheme of distribution Q. Say you're compressing english text but you're using a compressor tailored to spanish. The KL divergence is how many extra bits you need (on average) compared to encoding the english text with the english compressor:
Maybe more all that is essential for a molecular biologist isn't necessary for a general practitioner? It's just... those conference calls where you're explaining that because the classifier is working really well now doesn't mean that we can use it in production, those calls can get difficult and annoying, and sometimes the "other side" wins - with predictable results.
You bring up a very important point and a difficult one which is, if the decision making is in the hands of someone who does not understand the nuances too well nor has the time or inclination, what do you do ?
If your salary is going to depend on how many models you pushed out and not how well they continued to perform, many will optimize over the number of models pushed out.
A major source of problem (and sometimes a gift) is that you cannot prove a empirical statistical claim true or false in finite time. There is always this non-zero probability that the weirdest thing would happen. It could be just sheer bad luck that the model did so poorly in this cycle.
That's not because you need little background in information theory. That's because KL-divergences are such a universal info-theoretic quantity that if you deeply understand them, you understand much to most of information theory.
This is like saying, "You don't need to really know calculus, just integrals."
KL(P,Q) = E_P[Q_length(X)] - E_P[P_length(X)]