| HN Mirror

Sorry to respond so late.

I think it will be easiest for you to diagnose what you're misunderstanding if I present a simple discrete case.

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Consider a coin flip. 50% chance heads, 50% chance tails. This distribution is called Bernoulli, specifically Bernoulli(0.5). If we sample from this distribution, we get 1 (representing heads) with a 50% probability, or 0 (representing tails) with a 50% probability.

Now consider taking two samples, and calculating the likelihood of those two samples. Suppose we draw two samples from this distribution, HT (heads followed by tails). What is the probability that we got exactly these two samples from the distribution? Trivially, it's 0.5 * 0.5 = 0.25. Notice how this isn't the same as the probability of drawing any single sample (the probability of drawing any particular sample, that is, either heads or tails, is just 0.5).

The distribution representing the probability of a single sample of a coin flip lies in {0, 1}. You can think of this as a single-dimensional table, [0.5, 0.5], where each element represents the probability of the sample taking on the index of that element. (the probability of the sample taking on the value 0, which represents tails, is the 0th element of this array, 0.5. Similarly for 1, which represents heads).

Now think of the distribution for two samples. There are no longer two possibilities, but four - {0, 1} x {0, 1} = {(0, 0), (0, 1), (1, 0), (1, 1)} = {TT, TH, HT, HH}. We think of this as not a one-dimensional table but a two-dimensional table:

  second sample = 0(T) [[0.25,              0.25],
  second sample = 1(H) [0.25,               0.25]]
                       first sample = 0(T), first sample = 1(H)

Here, the element at row i and column j represents the probability that the first sample takes on value j and the second sample takes on value i.

For three samples, the distribution becomes three-dimensional, with the space of possibilities being {0,1}^3 = {(0, 0, 0), (0, 0, 1), (0, 1, 0), ...}.

For any of these tables, each element represents the probability that a sample takes on the corresponding value at the element's position. So, clearly, if we add up all the values in a table, no matter how many dimensions, it must sum up to 1. There is a 100% chance that a sequence of n samples takes on some value, after all.

What you're saying about drawing multiple samples approximating the underlying PDF is still true here (though we are not talking about the PDF in the discrete case, but rather the PMF - probability mass function - since each element in this table is actually a probability, not merely a measure of density). If you draw N samples from this distribution and plot it on a histogram (one bar for the number of heads you draw divided by N, one bar for the number of tails you drew divided by N), then this will approximate the underlying PMF, namely [0.5, 0.5]. But that is separate from the fact that drawing a particular sequence of N samples becomes decreasingly smaller as N increases. For N = 2, the probability of drawing any two particular samples (TT, TH, HT, or TT) is (1/2)^2. In general, it is (1/2)^N. One way to think about why it is (1/2)^N is that the distribution for N sample lies on the space {0, 1}^N, whose size is 2^N. The total probability, which is always 1, (no matter how large N is, it's still true that there's a 100% chance that a sequence of N samples is some sequence), needs to be distributed across the space 2^N. Every possibility is equally likely, so it's evenly distributed, so the probability is 1 / 2^N = (1/2)^N.

The same idea roughly applies in the continuous case, but importantly, in the continuous case, we're no longer talking about raw probabilities for a particular sample (the probability of drawing the value 0.3 from the distribution Uniform(0, 1) is exactly 0), but we're talking about probability density values. The same principle still applies though - if we "sum" (integrate) up all the PDF values for a distribution, since the PDF is the derivative of the CDF, by the fundamental theorem of calculus, we still should get 1. (The PDF is p(x) = d/dx P(X < x). Integrating both sides across all possible values X can take on, we will get integral from min possible value of x to max possible value of x of p(x) dx = P(X < max possible value of x) - P(X < min possible value of x) = 1 - 0 = 1). This total probability, 1, needs to be distributed across some space. The bigger the space is, the less densely it's going to be distributed, which is reflected in the lower value of the PDF.

Note:

To be sure, the space getting higher dimensional doesn't necessarily mean the PDF must be less dense. Consider Uniform(0, 0.5). When looking at the likelihood of two samples being from Uniform(0, 0.5), the probability is spread across [0, 0.5] x [0, 0.5], whose area is 0.50.5 = 0.25. Since the area is less than 1, the probability is actually more* dense---specifically, the PDF is 4 at any point in [0, 0.5] x [0, 0.5], but for just one sample, the PDF is 2 at any point in [0, 0.5]. Whether the probability gets less or more dense in the higher dimensional space representing the likelihood of multiple samples from the same distribution depends on the volume of the domain. For Uniform(0, 2), the space of two samples is [0,2] x [0,2], whose area is 2 * 2 = 4---this is larger than it is for just one sample, since the space for just one sample is [0,2], which covers 2 units of space. Accordingly, for just one sample, the PDF is 0.5, while for two samples, the PDF is 0.25. The larger this space, the less dense the probability is concentrated in this space, and vice versa. If we're thinking about uniform distributions, notice the space gets bigger for Uniform(0, L) if L > 1, and gets smaller for L < 1 since powers of L (representing the size of higher dimensional spaces, e.g. L^2 represents the size of [0, L] x [0, L], the space on which the PDF for two samples from the distribution must lie) get smaller if L < 1 but get bigger if L > 1. For the stack overflow post, the distribution in question is Gaussian, which takes on positive values on (-infinity, infinity), which you can think of as being more than large enough for the size of higher-dimensional spaces to be increasing, hence causing the PDF values to become smaller and smaller.

I hope I didn't make the problem more confusing by saying all that. If you're still confused, I can try to clear up things further, or I can point you to a better resource if you'd prefer that.