| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by fergal_reid 850 days ago

I think most of the replies, here and on stack exchange, are answering slightly the wrong question.

It is fair to ask why the likelihoods are useful if they are so small, and it's not a good answer to talk about how they could be expressed as logs, or even to talk about the properties of continuous distributions.

I think the answer is:

Yes, individual likelihoods are so small, that yes even a MLE solution is extremely unlikely to be correct.

However, the idea is that often a lot of the probability mass - an amount that is not small - will be concentrated around the maximum likelihood estimate, and so that's why it makes a good estimate, and worth using.

Much like how the average is unlikely to be the exact value of a new sample from the distribution, but it's a good way of describing what to expect. (And gets better if you augment it with some measure of dispersion, and so on). (If the distribution is very dispersed, then while the average is less useful as an idea of what to expect, it still minimises prediction error in some loss; but that's a different thing and I think less relevant here).

6 comments

bscphil 850 days ago

> It is fair to ask why the likelihoods are useful if they are so small

The way the question demonstrates "smallness" is wrong, however. They quote the product of the likelihoods of 50 randomly sampled values - 9.183016e-65 - as if the smallness of this value is significant or meant anything at all. Forget the issue of continuous sampling from a normal distribution, and just consider the simple discrete case of flipping a coin. The combined probability of any permutation of 50 flips is 0.5 ^ 50, a really small number. That's because the probability is, in fact, really small!

knightoffaith 850 days ago

Right - and so the more appropriate thing to do is not look at the raw likelihood of any one particular value but instead look at relative likelihoods to understand what values are more likely than other values.

adiM 850 days ago

Therefore, likelihood ratios! (Or log likelihood ratios)

anon946 850 days ago

For the discrete case, it seems that a better thing to do is consider the likelihood of getting that number of heads, rather than the likelihood of getting that exact sequence.

I am not sure how to handle the continuous case, however.

lupire 850 days ago

Of course you ignore irrelevant ordering of data points. That's not the issue.

The issue, for discrete or continuous (which are mathematically approximations of each other), is that the value at a point is less important than the integral over a range. That's why standard deviation is useful. The argmax is a convenient average over a weightable range of values. The larger your range, the greater the likelihood that the "truth" is in that range.

If you only need to be correct up to 1% tolerance, the likelihood of a range of values that have $SAMPLING_PRECISION tolerance is not importance. Only the argmax is, to give you a center of the range.

jvanderbot 850 days ago

Yes - the most enlightening concept for me was "Highest Probability Density Interval" which basically always is clustered around the mean. But you can choose any interval which contains as much probability mass!

https://en.wikipedia.org/wiki/Credible_interval#Choosing_a_c...

It's a fairly common "mistake" to assume that the MLE is useful as a point estimate and without considering covariance/spread/CI/HPDI/FIM/CRLB/Entropy/MI/KLD or some other measure of precision given the measurement set.

TobyTheCamel 850 days ago

> However, the idea is that often a lot of the probability mass - an amount that is not small - will be concentrated around the maximum likelihood estimate, and so that's why it makes a good estimate, and worth using.

This may be true for low dimensions but doesn’t generalise to high dimensions. Consider a 100-dimensional standard normal distribution for example. The MLE will still be at the origin but most of the mass will live in a thin shell of distance roughly 7 units from the origin.

blt 850 days ago

I think the "mass" they are referring to might the mass of the Bayesian posterior in parameter space, not the mass of the data in event space.

fergal_reid 850 days ago

Yes, in parameter space.

However, TobyTheCamel's point is valid in that there are some parameter spaces where the MLE is going to be much less useful than others.

Even without having to go to high dimensions, if you've got a posterior that looks like a normal distribution, the MLE is going to the you a lot, whereas if it's a multimodal distribution with a lot of mass scattered around, knowing the MLE much less informative.

But this is a complex topic to address in general, so I'm trying to stick to what I see as the intuition behind the original question!

lupire 850 days ago

Concentration of mass is density. A shell is not dense.

If I am looking for a needle in a hyperhaystack, it's not important to know that it's more likely to be "somewhere on the huge hyperboundary" than "in the center hypercubic inch".

zmgsabst 850 days ago

Disagree:

A lot of why large corporations fail to make products that people enjoy is tied up in this behavior and that mass is not independently distributed along each distribution — you end up with “continents of taste” your centroid product sucks for equally.

astrange 850 days ago

This is similar to how they originally tried to build fighter jet seats for the average pilot, but it failed because it turned out there were no average pilots, so they had to make them adjustable.

kgwgk 850 days ago

And yet your parent comment was right in saying that it won't be true that "a lot of the probability mass - an amount that is not small - will be concentrated" in the center hypercubic inch.

crazygringo 850 days ago

> Yes, individual likelihoods are so small, that yes even a MLE solution is extremely unlikely to be correct.

Can you elaborate? An MLE is never going to come up with the exact parameters that produced the samples, but in the original example, as long as you know it's a normal distribution, MLE is probably going to come up with a mean between 4 and 6 and a SD within a similar range as well (I haven't calculated it, just eyeballing it) -- when the original parameters were 5 and 5.

I guess I don't know what you mean by "correct", but that's as correct as you can get, based on just 50 samples.

fergal_reid 850 days ago

Right - I think this is what's at the heart of the original question.

I know they asked with a continuous example, but I don't interpret their question as limited to continuous cases, and I think it's easier to address using a discrete example, as we avoid the issue of each exact parameter having infinitesimal mass which occurs in a continuous setting.

Let's imagine the parameter we're trying to estimate is discrete and has, say, 500 different possible values.

Let's say the parameter can have the value of the integers between 1 and 500 and most of the mass is clustered in the middle between 230 and 270.

Given some data, it would actually be possible that MLE would come up with the exact value, say 250.

But maybe given the data, a range of values between 240 and 260 are also very plausible, so the likelihood of exactly 250 has a fairly low probability.

The original poster is confused, because they are basically saying, well, if the actual probability is so low, why is this MLE stuff useful?

You are pointing out they should really frame things in terms of a range and not a point estimate. You are right; but I think their question is still legitimate, because often in practice we do not give a range, and just give the maximum likelihood estimate of the parameter. (And also, separately, in a discrete parameter setting, specific parameter value could have substantial mass.)

So why is the MLE useful?

My answer would be, well, that's because for many posterior distributions, a lot of the probability mass will be near the MLE, if not exactly at it - so knowing the MLE is often useful, even if the probability of that exact value of the parameter is low.

aquafox 850 days ago

I agree with your points and thats why it's useful to compare a MLE to an alternative model via a likelihood ratio test, in which case one sees how much better the generative model performs as compared to the wrong model.

Similarly, AIC values do not make a lot of sense on an absolute scale but only relative to each other, as written in [1].

[1] Burnham, K. P., & Anderson, D. R. (2004). Multimodel inference: understanding AIC and BIC in model selection. Sociological methods & research, 33(2), 261-304.

agnosticmantis 850 days ago

> However, the idea is that often a lot of the probability mass - an amount that is not small - will be concentrated around the maximum likelihood estimate, and so that's why it makes a good estimate, and worth using.

This is a Bayesian point of view. The other answers are more frequentist, pointing out that likelihood at a parameter theta is NOT the probability of theta being the true parameter (given data). So we can't and don't interpret it like a probability.

klipt 850 days ago

Given enough data, Bayesian and frequentist models tend to converge to the same answer anyway.

Bayesian priors have similar effect to regularization (e.g. ridge regression / penalizing large parameter values).

LudwigNagasena 850 days ago

That's not a Bayesian point of view. You can re-word it in terms of a confidence interval / coverage probability. It is true that in frequentist statistics parameters don't have probability distributions, but their estimators very much do. And one of the main properties of a good estimator is formulated in terms of convergence in probability to the true parameter value (consistency).