Hacker News new | ask | show | jobs
by dopu 2157 days ago
Is it just me, or does probability theory in general have fairly terrible notation? Ambiguity between random variables and their distributions because of them simply being distinguished by being upper-case or lower-case, writing likelihood functions alternatively with an L() or p(), and using p() (with different arguments) to refer to different probability distributions. Perhaps I'm just having such a difficult time grokking probability theory because it's just difficult stuff, but I often find myself immensely frustrated with the notation.
7 comments

Probability notation used in ML and engineering has this problem, of overloading p(). Probability notation as used by probabilists in maths departments is completely different: it’s more explicit, and sometimes more clunky.

There’s a hybrid notation that I prefer, for example “Pr_X(x)” for the density function of random variable X at point x; you drop X if the random variable is clear from the context, and you drop x if you’re referring to the entire distribution. Or Pr_X(x|Y=y) for a conditional density. But this notation still has problems when you’re working with hairier conditional distributions, or with distributions that are neither discrete nor continuous.

(Source: used to be a mathematical probabilistic, now working in ML.)

I used to hate the way Bayesian ML people used p(...), until I realised that strictly speaking for a conditional variable we ought to be writing: p_X|Y=y(x). The variable is X|Y=y so all that ought to be in the subscript.

It's definitely worthwhile everyone using the full notation at least once so they can get a feel for what's really going on. I've spoken to Bayesian ML professionals who are especially unconfortable with that because it conditions on a zero-probability event (if Y is continuous)... of course p(x|y) does too, they just weren't thinking about it before! And (as I think you're getting at) the appreviated p(x|y) simply throws away information e.g. there's no way to represent the identity p_Y(x)=p_X(x) without adding back some sort of subscript.

But on the other hand p(x|y) is obviously much visually cleaner. If you're writing out a more complex identity and the abbreviated notation isn't ambiguous then it generally communicates the idea much more clearly because there's so much less visual noise.

It is the difficulty of the theory in my experience.

I've had / have trouble and misconceptions many times, but when discussing with someone fluent in the notation and the field, 100% of the times things cleared up, and I couldn't think of a better notation to use.

I think this is partially because it's applicable to so many different domains (cryptography, statistics, etc) that each have their own notational quirks. Avoiding collisions between notation in the application domain is more important than preserving some "consistent" probability notation (similar to the examples given with dot products). But the issue isn't even limited to notation: Chebyshev has at least 9 valid ways to spell his name (more if you include non-latin alphabets).

The biggest issue I've encountered is borrowing lecture slides from different universities/lecturers and the notation changes between slides on the same topic (even small things like using square brackets instead of parentheses).

I don't really know, I'm not an expert in the field. What I do know, is that I can usually get almost anything written in English in a statistics text, but when it goes mathematical notation I really struggle even with the simplest of concepts.

Type theory is a little like that, at least from some authors that - although the are is well suited for symbols - kind of goes a little off the rails using seven different kind of arrows and all the symbols in at least three different alphabets, instead of maybe just write covariant, or even an abbreviated form adjecent to the arrow?

It's a real mess. I ran into an issue recently because I'm dealing with probability distributions in terms of several sets of general curvilinear coordinate systems. In a context like this, the usual abuse of notation in which the function is identified by which arguments go into it just doesn't work. I have a probability density in Cartesian coordinates expressed as a function of (say) elliptical coordinates -- which differs by the Jacobian from the probability density in elliptical coordinates.
IMHO you are absolutely right. I would add that also it mixes reading from left to right with from right to left (think: P(b|a) P(a)). Personally, I really dislike that it made the trend that we use:

v_after = A_3 A_2 A_1 v

not (what I would consider more straightforward)

v_after = v A_1 A_2 A_3

Glad I am not the only one finding this extremely frustrating. At least mathematicians tend to be much more explicit than engineers. Unfortunately not every topic which uses probability has a textbook written by a mathematician available.