Hacker News new | ask | show | jobs
by ethanweinberger 2164 days ago
Hi HN, I'm the author of this piece (Ethan Weinberger). I wrote this originally as a set of notes for myself when brushing up on concepts in information theory the past couple of weeks. I found the presentations I was reading of the material to be a little dry for my taste, so I tried to incorporate more visuals and really emphasize the intuition behind the concepts. Glad to see others are finding it useful/interesting! :)
5 comments

Thanks, I enjoyed reading. As an electronic engineering student, I remember grappling with information theory in the abstract: it was a weather example very similar to yours that gave me the intuition I was missing.

An observation/suggestion. The intro is accessible to many people; that drops off a steep cliff when you hit the maths. Now, I'm not complaining about that: it's instructive and necessary to formalise things. Where I struggle is in reading the equations in my head when I don't know what words to use for the symbols. For example, that very first `X ~ p(x)`. I didn't know what to say for the tilde character, so couldn't verbalise the statement. I do know that $\in$ (the rounded 'E') means 'is a member of' so I could read the next statement. The problem gets even more confusing for a non-mathematician as the same symbol is used with different meaning in different branches of maths/science (e.g. $\Pi$).

I get that writing out every equation in English isn't feasible (or, at least, is asking a lot of the writer). But I wonder if there's middle way, e.g. through hyperlinking?

As I say: not a criticism and I don't have a good solution. Just an observation from a non-mathematician. Enjoyed the piece anyway.

"X ~ p(x)" means "X is a random variable drawn from the probability distribution p(x)" or maybe "X is drawn from p(x)" for short.

Are you sure it's a matter of knowing what to say (in your head) vs knowing the definition of the notation in the first place? I am pretty familiar with this notation, but I rarely verbalize it mentally. I can tell because I read and understand it quickly without problem, but on the rare occasion when I have to read it aloud I realize I'm not sure how I should pronounce it.

Thanks for the explanation.

Agree it's more "say in my head" than "speak out loud". But I still need to know what to say - internally or externally. Without knowing that ~ denotes "drawn from", all I can say is "X tilde p of x". That has no semantics; no intuition. Whereas knowing that $\in$ means "is a member of", I can read "x \in X" as "x is a member of X".

> but I rarely verbalize it mentally

Neither do I when I know something well. For example, I don't explicitly verbalise "is a member of" now, even internally. There's a shortcut hard-wired in that understands it without needing to pronounce it explicitly. In fact that short cut goes beyond the syntax: it goes straight to the intuition of "x represents any member of the set X". But I had to go through the process of saying it on the way to the shortcut.

OK, but if you know the formal definition, and you're not reading it out loud, why not just make something up? I actually don't know whether "is drawn from" is the "correct" way to pronounce the tilde. I think maybe other people say "is distributed as".
I don't think a piece on information theory should necessarily be "accessible to many people". It's a topic which is normally taught in grad school.

Something like X ~ p(x) would be seen all over the place in probability, stats, ML, and related courses such as info theory, detection and estimation, etc. Likely by the time someone is interested in info theory this notation would be permanently etched into their minds. So for this article it is very "audience appropriate".

> not a criticism and I don't have a good solution

Having a mental map of how different subjects fit together (without actually having to studying them in-depth) is a good start.

I've seen so many people crash and burn with machine learning because they were unaware that it depends on linear algebra, calculus, and probability.

With a mental map there is less "surprise" and it's more a matter of simply understanding that they didn't have the right dependencies.

I read X ~ p(x) as "X is distributed as p of x"
What would X ~ p(y) mean?
X is distributed as p(y) where p is a probability distribution parameterized by y
Ethan also writes about machine learning at https://honestyisbest.com/kernels-of-truth each week -- his most recent piece there (https://honestyisbest.com/kernels-of-truth/2020/Jul/14/facia...) has a neat explanation of how convolutional neural networks (CNNs) work.
You might want to give more conditions for the claim that the self-information gives the length of the shortest possible code.

In particular the condition isn't only that it's the shortest code, it's the shortest self-delimiting code. In your example with probabilities {1/2, 1/4, 1/8, 1/8}, someone could come in and say let's code it as {0,1,01,00}, which would appear to encode the latter two outcomes 2 bits rather than 3. The problem, of course, is that {0,1,01,00} is not a self-delimiting code: after you receive the bit 0, you don't know if you're done or if you should wait for another bit to form 01. But the code {0,10,110,111} is self-delimiting, because after you get a 0 or a 10, etc., you know you're done.

I've found that when I teach this material, if I don't mention the self-delimiting condition, then a clever student in the class always thinks of the {0,1,01,00}-type code. (This can be a good way to identify clever students in an intro information theory class!)

Thank you for the great article. I believe there is a typo in "we assign a value of 0 to p(x) log p(x) when x=0", it should be "when p(x) = 0".
Thanks for the catch. Fixed!
Awesome paper Ethan!!!