Hacker News new | ask | show | jobs
by kjhcvkek77 859 days ago
Because it works well in practice. And to elaborate, usually when something works well in practice it's because it has multiple desirable properties - the one you "ask for", but also other ones you get for free.

In this case maximum likelihood approximate bayesian estimation with a mostly reasonable prior. Furthermore you could look at the convergence properties which are good.

You could probably design some degenerate probability distribution that ml-estimation behaves really badly for, but those are not common in practice.

2 comments

It's better than "it works well in practice".

The question is misguided as stated. It's like asking why chemists care about density for measuring mass.

If you are looking at the likelihood of any particular outcome of a continuous random variable, then you do not understand how probability works.

The probability of any particular real number arising from a probability distribution on the real numbers is exactly 0. It's not an arbitrarily small epsilon greater than zero, it's actually zero. This definition is in fact required for probability to sense mathematically.

You might ask questions like why does maximum likelihood work as an optimization criterion, but that's very different from asking why we care about likelihood at all.

The comments on the original question do a good job of cutting through this confusion.

I appreciate your response but I don't really agree. They say that likelihood can be multiplied by any scale factor or that it's only the comparative difference that matters, or we can make a little plot, but they don't actually explain why.

I can try to make an explanation from the bayesian framework(but as I mentioned it's not the only relevant one)

Likelihood is P(measurement=measurement'|parameter=parameter'). This is a small value. Given a prior we can P(parameter=parameter'|measurement=measurement'). This is also small. But when we compute P(parameter'-k<parameter<parameter'+k|measurement=measurement') then all the smallness cancels see the formulation of bayes that reads

P(X_i|Y) = (P(X_i)P(Y|X_i)/(sum_j P(X_j)P(Y|X_j))

I'm obviously skipping a lot of steps here because I'm sketching an explanation rather than giving one.

> The probability of any particular real number arising from a probability distribution on the real numbers is exactly 0. It's not an arbitrarily small epsilon greater than zero, it's actually zero.

Nitpicking somewhat, but e.g. `max(1, uniform(0, 2))` has a very non-zero probability of evaluating to 1.

> You could probably design some degenerate probability distribution that ml-estimation behaves really badly for, but those are not common in practice.

Anything multimodal...